SparkTabularAutoML

class sparklightautoml.automl.presets.tabular_presets.SparkTabularAutoML(spark, task, timeout=3600, memory_limit=16, cpu_limit=4, gpu_ids='all', timing_params=None, config_path=None, general_params=None, reader_params=None, read_csv_params=None, nested_cv_params=None, tuning_params=None, selection_params=None, lgb_params=None, cb_params=None, linear_l2_params=None, gbm_pipeline_params=None, linear_pipeline_params=None, persistence_manager=None, computation_settings=('no_parallelism', -1))[source]

Bases: SparkAutoMLPreset

Spark version of TabularAutoML. Represent high level entity of spark lightautoml. Use this class to create automl instance.

Example

>>> automl = SparkTabularAutoML(
>>>     spark=spark,
>>>     task=SparkTask('binary'),
>>>     general_params={"use_algos": [["lgb"]]},
>>>     lgb_params={'use_single_dataset_mode': True},
>>>     reader_params={"cv": cv, "advanced_roles": False}
>>> )
>>> oof_predictions = automl.fit_predict(
>>>     train_data,
>>>     roles=roles
>>> )

create_automl(**fit_args)[source]

Create basic automl instance.

Parameters:: **fit_args – Contain all information needed for creating automl.

fit_predict(train_data, roles=None, train_features=None, cv_iter=None, valid_data=None, valid_features=None, log_file=None, verbose=0, persistence_manager=None)[source]

Fit and get prediction on validation dataset.

Almost same as lightautoml.automl.base.AutoML.fit_predict.

Additional features - working with different data formats. Supported now:

Path to .csv, .parquet, .feather files.

ndarray, or dict of ndarray. For example, {'data': X...}. In this case, roles are optional, but train_features and valid_features required.

pandas.DataFrame.

Parameters:

train_data (Union[str, DataFrame]) – Dataset to train.
roles (Optional[dict]) – Roles dict.
train_features (Optional[Sequence[str]]) – Optional features names, if can’t be inferred from train_data.
cv_iter (Optional[Iterable]) – Custom cv-iterator. For example, TimeSeriesIterator.
valid_data (Union[str, DataFrame, None]) – Optional validation dataset.
valid_features (Optional[Sequence[str]]) – Optional validation dataset features if cannot be inferred from valid_data.
verbose (int) – Controls the verbosity: the higher, the more messages. <1 : messages are not displayed; >=1 : the computation process for layers is displayed; >=2 : the information about folds processing is also displayed; >=3 : the hyperparameters optimization process is also displayed; >=4 : the training process for every algorithm is displayed;
log_file (Optional[str]) – Filename for writing logging messages. If log_file is specified,
exists (the messages will be saved in a the file. If the file) –
overwritten. (it will be) –

Return type:

SparkDataset

Returns:

Dataset with predictions. Call .data to get predictions array.

static get_pdp_data_numeric_feature(df, feature_name, model, prediction_col, n_bins, ice_fraction=1.0, ice_fraction_seed=42)[source]

Returns grid, ys and counts calculated on input numeric column to plot PDP.

Parameters:

df (SparkDataFrame) – Spark DataFrame with feature_name column
feature_name (str) – feature column name
model (PipelineModel) – Spark Pipeline Model
prediction_col (str) – prediction column to be created by the model
n_bins (int) – The number of bins to produce. Raises exception if n_bins < 2.
ice_fraction (float, optional) – What fraction of the input dataframe will be used to make predictions. Useful for very large dataframe. Defaults to 1.0.
ice_fraction_seed (int, optional) – Seed for ice_fraction. Defaults to 42.

Returns:

grid is list of categories, ys is list of predictions by category, counts is numbers of values by category

Return type:

Tuple[List, List, List]

static get_pdp_data_categorical_feature(df, feature_name, model, prediction_col, n_top_cats, ice_fraction=1.0, ice_fraction_seed=42)[source]

Returns grid, ys and counts calculated on input categorical column to plot PDP.

Parameters:

df (SparkDataFrame) – Spark DataFrame with feature_name column
feature_name (str) – feature column name
model (PipelineModel) – Spark Pipeline Model
prediction_col (str) – prediction column to be created by the model
n_top_cats (int) – param to selection top n categories
ice_fraction (float, optional) – What fraction of the input dataframe will be used to make predictions. Useful for very large dataframe. Defaults to 1.0.
ice_fraction_seed (int, optional) – Seed for ice_fraction. Defaults to 42.

Returns:

grid is list of categories, ys is list of predictions by category, counts is numbers of values by category

Return type:

Tuple[List, List, List]

static get_pdp_data_datetime_feature(df, feature_name, model, prediction_col, datetime_level, reader, ice_fraction=1.0, ice_fraction_seed=42)[source]

Returns grid, ys and counts calculated on input datetime column to plot PDP.

Parameters:

df (SparkDataFrame) – Spark DataFrame with feature_name column
feature_name (str) – feature column name
model (PipelineModel) – Spark Pipeline Model
prediction_col (str) – prediction column to be created by the model
datetime_level (str) – Unit of time that will be modified to calculate dependence: “year”, “month” or “dayofweek”
reader (_type_) – Automl reader to transform input dataframe before model inferring.
ice_fraction (float, optional) – What fraction of the input dataframe will be used to make predictions. Useful for very large dataframe. Defaults to 1.0.
ice_fraction_seed (int, optional) – Seed for ice_fraction. Defaults to 42.

Returns:

grid is list of categories, ys is list of predictions by category, counts is numbers of values by category

Return type:

Tuple[List, List, List]