SparkDataset
- class sparklightautoml.dataset.base.SparkDataset(data, roles, persistence_manager=None, task=None, bucketized=False, dependencies=None, name=None, target=None, folds=None, **kwargs)[source]
Bases:
LAMLDataset
,Unpersistable
Implements a dataset that uses a
pyspark.sql.DataFrame
internally, stores some internal state (features, roles, …) and provide methods to work with dataset.- classmethod concatenate(datasets, name=None, extra_dependencies=None)[source]
Concat multiple datasets by joining their internal
pyspark.sql.DataFrame
using inner join on special hidden ‘_id’ column :type datasets:Sequence
[SparkDataset
] :param datasets: spark datasets to be joined- Return type:
- Returns:
a joined dataset, containing features (and columns too) from all datasets except containing only one _id column
- property features
Get list of features.
- Returns:
list of features.
- property roles
Roles dict.
- set_data(data, features, roles=None, persistence_manager=None, dependencies=None, uid=None, name=None, frozen=False)[source]
Inplace set data, features, roles for empty dataset.
- persist(level=None, force=False)[source]
Materializes current Spark DataFrame and unpersists all its dependencies :type level:
Optional
[PersistenceLevel
] :param level:- Return type:
- Returns:
a new SparkDataset that is persisted and materialized