-
-
Notifications
You must be signed in to change notification settings - Fork 34
Slep007 - feature names, their generation and the API #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 3 commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
51fc476
initial 007
adrinjalali 28d7b84
fix typo
adrinjalali 3fcbefa
fix code block
adrinjalali e13bc47
rewrite the SLEP
adrinjalali e9ca87b
clarify the flexibility on metaestimators
adrinjalali c56dbe9
add motivation and clarifications
adrinjalali 6cfe3c8
apply more comments
adrinjalali 3e77b4f
address andy's comments
adrinjalali ff5a991
add examples
adrinjalali 9d380da
add redundant prefix example, clarify O(1) issue
adrinjalali c5659bf
Merge remote-tracking branch 'upstream/master' into slep007
adrinjalali d0bb0e6
put slep under review
adrinjalali 9b24545
address Nicolas's suggestions
adrinjalali 081ed93
Update slep007/proposal.rst
adrinjalali 1baea78
change the title
adrinjalali 51de1f7
Merge branch 'slep007' of github.com:adrinjalali/enhancement_proposal…
adrinjalali 50c6538
shorted example
adrinjalali c37d9d6
address Nicolas's comments, remove onetoone mapping
adrinjalali 17fe3d7
address Nicolas's comments
adrinjalali 6b5533c
trying to address Guillaume's comments
adrinjalali 4ed249c
imagine -> include
adrinjalali File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
.. _slep_007: | ||
|
||
============= | ||
Feature Names | ||
============= | ||
|
||
`scikit-learn/#13307 <https://github.com/scikit-learn/scikit-learn/pull/13307>`_ | ||
proposes a solution to support and propagate feature names through a pipeline. | ||
|
||
However, the generated feature names can become long. This is to confirm how | ||
we want those generated feature names to behave, and this is the proposal:: | ||
|
||
|
||
from sklearn.pipeline import make_pipeline | ||
from sklearn.preprocessing import StandardScaler | ||
from sklearn.decomposition import PCA | ||
from sklearn.compose import make_column_transformer | ||
from sklearn.datasets import load_iris | ||
from sklearn.linear_model import LogisticRegression | ||
from sklearn.feature_selection import SelectKBest | ||
import pandas as pd | ||
|
||
iris = load_iris() | ||
X = pd.DataFrame(iris.data, columns=iris.feature_names) | ||
pipe = make_pipeline(StandardScaler(), PCA(), SelectKBest(k=2), | ||
LogisticRegression()) | ||
pipe.fit(X, iris.target) | ||
pipe[-1].input_features_ | ||
> array(['pca0', 'pca1'], dtype='<U4') | ||
|
||
|
||
# I have to duplicate StandardScaler if I want to be able to | ||
# use the pandas column names in the column transformer. Yuk! | ||
# also there's no easy way to pass through the original columns with | ||
# ColumnTransformer - FunctionTransformer would remove feature names! | ||
pipe = make_pipeline(make_column_transformer((make_pipeline(StandardScaler(), | ||
PCA()), X.columns), | ||
(StandardScaler(), | ||
X.columns[:2])), | ||
SelectKBest(k=2), LogisticRegression()) | ||
pipe.fit(X, iris.target) | ||
pipe[-1].input_features_ | ||
``` | ||
> array(['pipeline__pca0', 'standardscaler__sepal length (cm)'], dtype='<U33') | ||
|
||
pipe = make_pipeline(make_column_transformer((PCA(), X.columns), | ||
(StandardScaler(), | ||
X.columns[:2])), | ||
SelectKBest(k=2), LogisticRegression()) | ||
pipe.fit(X, iris.target) | ||
pipe[-1].input_features_ | ||
|
||
> array(['pca__pca0', 'standardscaler__sepal length (cm)'], dtype='<U33') | ||
|
||
Is that what we want? (apart from changing to object dtype lol) | ||
The first one seems good to me, the others seem a bit long? Not sure how to do | ||
better though. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't you put the standard scaler in front of the column transformer here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea is that "the user could do this", just to showcase what happens if they do. Not that it's the best way to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you can not, as the comment suggests: this proposal is about creating feature names, not doing "pandas in, pandas out". StandardScaler will therefore create a numpy array as output, so ColumnTransformer can not use column names. A work-around would be to use column indices in the column transformer, or a boolean mask (explicitly using knowledge that StandardScaler will preserve columns).