0
$\begingroup$

I am trying to create an automatic pipeline builder functionality that takes into account a large set of conditions such as the existence of missing values, the scale of numerical features, etc., and automatically creates a scikit-learn pipeline instead of having to manually create them every time.

I'm aware of pipeline.steps.append() functionality that allows to assign new pipeline steps dynamically. However it seems to be not allowed to initialize an empty pipeline to start appending to; doing the following yields an error:

from sklearn.pipeline import Pipeline pipe = Pipeline([]) 

This returns ValueError: not enough values to unpack (expected 2, got 0).

Additionally, I also tried passing if conditions directly to pipeline steps the following way, again without success:

pipe = Pipeline([ ('numerical_scaler', StandardScaler(), num_columns_to_scale) if num_columns_to_scale, ('categorical_encoder', OneHotEncoder(), cat_columns_to_encode) if cat_columns_to_encode ]) 

This returns SyntaxError: invalid syntax.

What would be the best way to create such auto-pipelining functionality? As a dirty workaround I could obviously create a huge collection of if-else conditions to create pipelines that way but that is particularly error prone and difficult to maintain.

Edit: Point of this auto pipeline functionality is to speed up the creation of custom pipelines. Ideally I want to input a dataset, specify the target(s) and let the algorithm create a custom pipeline for the given dataset.

$\endgroup$

    2 Answers 2

    1
    $\begingroup$

    This is still unclear what and why you wanted to do something like what you describe, if you add more context I will try to help.

    You could solve the second point with column_transformer

    from sklearn.pipeline import. Pipeline from sklearn.compose import make_column_transformer, make_column_selector as selector numeric_transformer = Pipeline([("imputer", SimpleImputer(strategy= "median")), ("binning", KBinsDiscretizer(encode = "onehot-dense", strategy= "kmeans"))]) categorical_transformer = Pipeline([("Imputer", SimpleImputer(strategy= "constant", fill_value = "missing")), ("encoding", OneHotEncoder(handle_unknown = "ignore"))]) preprocessor = make_column_transformer((numeric_transformer, selector(dtype_exclude = "object")), (categorical_transformer, selector(dtype_include = "object"))) pipeline = Pipeline([("proprocessing",preprocessor), ("model",LogisticRegression())]) 
    $\endgroup$
    1
    • $\begingroup$I know about make_column_selector but this solution doesn't allow the same flexibility. I would like to have the option to specify the columns to apply OneHotEncoding on, not just blindly assume that all object-type features are to be one hot encoded. I added a bit of context to the post as well.$\endgroup$
      – lazarea
      CommentedFeb 25, 2022 at 16:04
    0
    $\begingroup$

    Pipeline's input is a list of steps, so you can dynamically create this list and then feed it into Pipeline object:

    names = ['numerical_scaler', 'categorical_encoder'] transformers = [StandardScaler(), OneHotEncoder()] columns = [num_columns_to_scale, cat_columns_to_encode] steps = [] for name, transformer, cols in zip(names, transformers, columns): if len(cols) > 0: steps.append((name, transformer, cols)) pipe = Pipeline(steps) 
    $\endgroup$

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.