Data Flow¶
Overview¶
Data moves through the pipeline from DataSource → Stage → Head, assembled at each node according to its edges definition.
When Head C runs, its edges are resolved by pulling columns from DataSource and/or the outputs of Stage A, Stage B.
data_dict Structure¶
Each node receives a data_dict — a dict mapping edge keys to data.
In Experimenter¶
train: full training fold datatrain_v: training fold filtered byoutput_var(for inner validation)valid: validation fold data
In Trainer¶
No inner fold — data is split once by the splitter.
Cache¶
Experimenter uses an LRU cache (capacity-based, default 4 GB) to store Stage outputs. When a Stage node's output is requested by multiple downstream nodes, it is computed once and reused from cache.
Trainer shares the same cache instance with its parent Experimenter, using "train_all" as the type key to avoid collisions.
X-less Nodes¶
If a node's edges contains only 'y' and no 'X' (e.g. LabelEncoder), the 'y' data is used as the primary input. The processor receives the y array directly, and its output becomes the new y columns.