Pipeline¶

A Pipeline is a node graph that describes the structure of an ML workflow.

Roles¶

Every node has one of three roles:

Role	Class	Purpose
DataSource	(implicit)	The original input data. Not a real node — represented as `None` in edges.
Stage	`TransformProcessor`	Transforms data and passes it downstream. Stays alive to supply data to child nodes.
Head	`PredictProcessor`	Consumes transformed data and produces predictions. Terminal node.

Nodes and Groups¶

A node is the unit of execution. Each node has:

processor: the class that does the work (e.g. StandardScaler, LGBMClassifier)
edges: which upstream nodes supply which variables
method: method name to call on the processor
adapter: optional wrapper that translates data and params to framework conventions
params: constructor parameters for the processor
desc: optional free-text description (not inherited from group)

A group (PipelineGroup) lets multiple nodes share configuration. Node attributes override group attributes; group attributes override parent group attributes.

Note: desc is the only attribute that is not inherited — each group and node holds its own description independently. Changing desc alone has no effect on which nodes are considered affected or need rebuilding.

edges¶

edges defines what data a node receives and from where.

edges = {
    'X': [(None, ['feature1', 'feature2']),   # from DataSource
           ('stage1', None)],                  # all columns from stage1
    'y': [(None, 'target')],
}

Keys name variable sets ('X', 'y', 'sample_weight', …)
Each value is a list of (node_name, var_spec) pairs
node_name=None means DataSource
var_spec: None (all columns), str, list, or callable
Multiple entries for the same key are concatenated column-wise
Child nodes inherit and extend parent group edges