Processors¶
ml-labs provides a set of built-in sklearn-compatible processors for common preprocessing tasks.
Built-in Processors¶
CatConverter¶
Converts specified columns to a categorical dtype (pandas category / polars Categorical).
from mllabs.processor import CatConverter
CatConverter(columns=None) # None → convert all columns
CatConverter(columns=['city', 'gender'])
Output columns are the same as input columns with dtype changed. Supports pandas, polars, and numpy inputs.
CatOOVFilter¶
Filters out-of-vocabulary (OOV) values in categorical columns. Values not seen during fit() (or seen fewer than min_frequency times) are replaced with missing_value.
from mllabs.processor import CatOOVFilter
CatOOVFilter(
columns=None, # None → all columns
min_frequency=0, # values with count <= this are treated as OOV
missing_value=None, # replacement for OOV/missing values
treat_empty_string_as_missing=True,
)
Fitted attribute: categories_ — dict of {column: [allowed_values]}.
CatPairCombiner¶
Combines groups of categorical columns into new interaction columns by concatenating their string representations. Supports N-way (2 or more columns per group).
from mllabs.processor import CatPairCombiner
# pair (2-way)
CatPairCombiner(
pairs=[('city', 'gender'), ('age_bin', 'city')],
sep='__',
treat_empty_string_as_missing=True,
new_col_names=None, # None → 'city__gender', 'age_bin__city'
)
# N-way (3 or more columns)
CatPairCombiner(
pairs=[('city', 'gender', 'age_bin')],
sep='__',
)
# output: 'city__gender__age_bin'
Each group element can be a column name (str) or integer index (numpy). If any value in a group is missing, the output is None. Supports pandas, polars, and numpy inputs.
TypeConverter¶
Converts all columns to a target type. Useful for normalizing mixed-type columns before feeding into processors that require a specific dtype.
from mllabs.processor import TypeConverter
TypeConverter(to='str') # str, int, or float
TypeConverter(to='float')
| Backend | Behavior |
|---|---|
| pandas | df.astype(to) |
| polars | cast to pl.Utf8 / pl.Int64 / pl.Float64 |
| numpy | array.astype(to) |
get_feature_names_out is supported — column names are preserved after conversion.
FrequencyEncoder¶
Replaces each categorical value with its frequency (proportion or count) observed during fit().
from mllabs.processor import FrequencyEncoder
FrequencyEncoder(normalize=True) # True → proportion, False → count
Output column names are {col}_freq. Unseen values at transform time receive 0.
Polars-Optional Processors¶
Available only when polars is installed (pip install ml-labs[polars]).
PolarsLoader¶
Reads one or more CSV files into a Polars DataFrame with automatically optimized dtypes. Uses get_type_df internally to analyze data ranges.
from mllabs.processor import PolarsLoader
PolarsLoader(
predefined_types={'id': pl.Int64}, # override specific column types
read_method='read_csv', # any pl.read_* method name
)
# fit: accepts a file path (str) or list of file paths
# transform: returns pl.DataFrame with optimized schema
ExprProcessor¶
Applies a dict of Polars expressions to a DataFrame via with_columns or select.
from mllabs.processor import ExprProcessor
import polars as pl
ExprProcessor(
dict_expr={
'log_income': pl.col('income').log1p(),
'age_sq': pl.col('age') ** 2,
},
with_columns=True, # True → add/replace columns; False → select only these columns
)
When with_columns=True, output includes all original columns plus the new/replaced ones. When False, output contains only the expressions defined in dict_expr.
PandasConverter¶
Converts a Polars DataFrame to pandas. Useful as the final stage in a Polars-based pipeline.
from mllabs.processor import PandasConverter
PandasConverter(index_col=None) # optionally set a column as the DataFrame index
Type Utilities¶
Utilities for analyzing column types and generating optimal dtype mappings. Useful for preprocessing pipelines that need to minimize memory usage.
get_type_df(df)¶
Analyzes a Polars DataFrame (or LazyFrame) and returns a pd.DataFrame with per-column statistics and dtype-fit flags.
df_type = pl.scan_csv('train.csv').pipe(get_type_df)
# Columns: min, max, na, count, n_unique, dtype, f32, i32, i16, i8
# f32/i32/i16/i8: True if numeric values fit within that type's range
get_type_pl(df_type, ...) / get_type_pd(df_type, ...)¶
Convert the df_type analysis into a type mapping dict for Polars or pandas loading.
pl_types = get_type_pl(
df_type,
predefine={'id': pl.Int64}, # always override these
f32=True, # use Float32 where possible
i64=False, # use smallest int type that fits
cat_max=50, # columns with ≤50 unique values → Categorical
txt_cols=['description'], # force these columns to String
)
pd_types = get_type_pd(df_type, predefine={'id': 'int64'}, cat_max=50)
merge_type_df(dfs)¶
Merges df_type results from multiple files (e.g., sharded datasets) into a single summary. Takes the global min/max across files to determine safe dtype ranges.