SparkLinearFeatures

class sparklightautoml.pipelines.features.linear_pipeline.SparkLinearFeatures(feats_imp=None, top_intersections=5, max_bin_count=10, max_intersection_depth=3, subsample=None, sparse_ohe='auto', auto_unique_co=50, output_categories=True, multiclass_te_co=3, **_)[source]

Bases: SparkFeaturesPipeline, SparkTabularDataFeatures

Creates pipeline for linear models and nnets.

Includes:

Create categorical intersections.

OHE or embed idx encoding for categories.

Other cats to numbers ways if defined in role params.

Standartization and nan handling for numbers.

Numbers discretization if needed.

Dates handling.

Handling probs (output of lower level models).

__init__(feats_imp=None, top_intersections=5, max_bin_count=10, max_intersection_depth=3, subsample=None, sparse_ohe='auto', auto_unique_co=50, output_categories=True, multiclass_te_co=3, **_)[source]

Parameters:

feats_imp (Optional[ImportanceEstimator]) – Features importances mapping.
top_intersections (int) – Max number of categories to generate intersections.
max_bin_count (int) – Max number of bins to discretize numbers.
max_intersection_depth (int) – Max depth of cat intersection.
subsample (Union[int, float, None]) – Subsample to calc data statistics.
sparse_ohe (Union[str, bool]) – Should we output sparse if ohe encoding was used during cat handling.
auto_unique_co (int) – Switch to target encoding if high cardinality.
output_categories (bool) – Output encoded categories or embed idxs.
multiclass_te_co (int) – Cutoff if use target encoding in cat handling on multiclass task if number of classes is high.

create_pipeline(train)[source]

Create linear pipeline.

Parameters:: train (SparkDataset) – Dataset with train features.
Return type:: Union[SparkBaseEstimator, SparkBaseTransformer, SparkUnionTransformer, SparkSequentialTransformer]
Returns:: Transformer.