Skip to content

Slep007 - feature names, their generation and the API #17

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Feb 14, 2020
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions slep007/proposal.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
.. _slep_007:

=============
Feature Names
=============

`scikit-learn/#13307 <https://github.com/scikit-learn/scikit-learn/pull/13307>`_
proposes a solution to support and propagate feature names through a pipeline.

However, the generated feature names can become long. This is to confirm how
we want those generated feature names to behave, and this is the proposal::


from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.compose import make_column_transformer
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest
import pandas as pd

iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
pipe = make_pipeline(StandardScaler(), PCA(), SelectKBest(k=2),
LogisticRegression())
pipe.fit(X, iris.target)
pipe[-1].input_features_
> array(['pca0', 'pca1'], dtype='<U4')


# I have to duplicate StandardScaler if I want to be able to
# use the pandas column names in the column transformer. Yuk!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you put the standard scaler in front of the column transformer here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is that "the user could do this", just to showcase what happens if they do. Not that it's the best way to do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you can not, as the comment suggests: this proposal is about creating feature names, not doing "pandas in, pandas out". StandardScaler will therefore create a numpy array as output, so ColumnTransformer can not use column names. A work-around would be to use column indices in the column transformer, or a boolean mask (explicitly using knowledge that StandardScaler will preserve columns).

# also there's no easy way to pass through the original columns with
# ColumnTransformer - FunctionTransformer would remove feature names!
pipe = make_pipeline(make_column_transformer((make_pipeline(StandardScaler(),
PCA()), X.columns),
(StandardScaler(),
X.columns[:2])),
SelectKBest(k=2), LogisticRegression())
pipe.fit(X, iris.target)
pipe[-1].input_features_
```
> array(['pipeline__pca0', 'standardscaler__sepal length (cm)'], dtype='<U33')

pipe = make_pipeline(make_column_transformer((PCA(), X.columns),
(StandardScaler(),
X.columns[:2])),
SelectKBest(k=2), LogisticRegression())
pipe.fit(X, iris.target)
pipe[-1].input_features_

> array(['pca__pca0', 'standardscaler__sepal length (cm)'], dtype='<U33')

Is that what we want? (apart from changing to object dtype lol)
The first one seems good to me, the others seem a bit long? Not sure how to do
better though.