Skip to content

Data Flow

Overview

Data moves through the pipeline from DataSource → Stage → Head, assembled at each node according to its edges definition.

DataSource
Stage A  ──►  Stage B
              Head C

When Head C runs, its edges are resolved by pulling columns from DataSource and/or the outputs of Stage A, Stage B.

data_dict Structure

Each node receives a data_dict — a dict mapping edge keys to data.

In Experimenter

data_dict = {
    'X': ((X_train, X_train_v), X_valid),
    'y': ((y_train, y_train_v), y_valid),
}
  • train: full training fold data
  • train_v: training fold filtered by output_var (for inner validation)
  • valid: validation fold data

In Trainer

data_dict = {
    'X': (X_train, X_valid),
    'y': (y_train, y_valid),
}

No inner fold — data is split once by the splitter.

Cache

Experimenter uses an LRU cache (capacity-based, default 4 GB) to store Stage outputs. When a Stage node's output is requested by multiple downstream nodes, it is computed once and reused from cache.

Trainer shares the same cache instance with its parent Experimenter, using "train_all" as the type key to avoid collisions.

X-less Nodes

If a node's edges contains only 'y' and no 'X' (e.g. LabelEncoder), the 'y' data is used as the primary input. The processor receives the y array directly, and its output becomes the new y columns.