-
-
Notifications
You must be signed in to change notification settings - Fork 34
Slep007 - feature names, their generation and the API #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
My biggest concern right now is what the feature names should actually be, and I think my doubts are mostly around creating new feature names within pipelines and column transformer. |
I don't like the Maybe explaining better the mechanism for feature names would make it clearer? There I talked about three cases: a) The transformer does a non-trivial column transformation, like OneHotEncoder or feature selection I guess a slight confusion I have right now is because there is another case, which is the My question right now is: do we always prepend the estimator name in ColumnTransformer, even if it generates things like |
regarding the redundancy of the feature names, we could
|
Not sure I really get the 2nd. |
I've been playing with dask and pandas on very large datasets with many inconsistencies. The experience has really reminded me that code that does not have very consistent behavior leads to nasty bugs. |
slep007/proposal.rst
Outdated
|
||
|
||
# I have to duplicate StandardScaler if I want to be able to | ||
# use the pandas column names in the column transformer. Yuk! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't you put the standard scaler in front of the column transformer here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea is that "the user could do this", just to showcase what happens if they do. Not that it's the best way to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you can not, as the comment suggests: this proposal is about creating feature names, not doing "pandas in, pandas out". StandardScaler will therefore create a numpy array as output, so ColumnTransformer can not use column names. A work-around would be to use column indices in the column transformer, or a boolean mask (explicitly using knowledge that StandardScaler will preserve columns).
more like: feature_names = []
for each transformer:
feature_names.append(transformer.get_names())
if len(set(feature_names)) == len(feature_names) or not enforce_unique_names:
return feature_names
else:
append_prefixes_to_feature_names()
something along the lines of: feature_names = {}
for each transformer:
feature_names[transformer] = transformer.get_names()
if fname_trasformer is not None:
return fname_transformer(feature_names) |
@GaelVaroquaux I'm not sure I understand what your comment about inconsistent behavior is about. Was that re @adrinjalali's suggestion of sometimes not appending the name? |
@adrinjalali The second one doesn't really make sense the way you wrote it, I think. If you want to replicate the prefix behavior, I don't like 1) because it means that a feature name of an existing feature can change when I add additional features. That's quite counter-intuitive. |
@GaelVaroquaux I'm not sure I understand what your comment about inconsistent behavior is about. Was that re @adrinjalali's suggestion of sometimes not appending the name?
Yes.
|
No, you can not, as the comment suggests: this proposal is about creating feature names, not doing "pandas in, pandas out". StandardScaler will therefore create a numpy array as output, so ColumnTransformer can not use column names.
Point taken. Thank you.
|
I'm a bit annoyed at the duplication of the standard scaler, but I feel it will be hard to make this work better without blowing up the scope of the proposal, and it highlights a shortcoming of the current design; the shortcoming is unrelated to the feature name proposal, though, it's "only" related to dealing with pandas data frames. |
Those are the most reasonable options I can think of.
Should we then have a parameter like
Yes, exactly, we can even get it from the |
parameter to what? each estimator? just the ColumnTransformer? global configuration (errrr probably not if column names will be used in actual behavior). |
A parameter used by What we can do is to expose a function |
ok so you mean adding a parameter to the constructor. I don't understand your argument about having to pass a dict. It's a constructor argument for a single estimator, right? So I know exactly which estimator I'm applying it to. |
If we accept the extra parameter to the constructor, then we don't need the dict. The dict was the alternative: if we reject the proposal to have the constructor argument, and want to have the policy argument only available to a few objects such as ColumnTransformer, Pipeline, etc, then we need the dict for a more granular control over the policies of each underlying object. To be clear, this is not what I'm proposing and I personally dislike it. |
A random thought: with this approach the following will have the same feature names SelectKBest before PCAThree of the original features were passed into the PCA. pipe = make_pipeline(StandardScaler(), SelectKBest(k=3), PCA(n_components=2),
LogisticRegression())
pipe.fit(X, iris.target)
pipe[-1].input_features_
# array(['pca0', 'pca1'], dtype='<U4') SelectKBest after PCAAll the original features are passed into PCA, and two of the pca features are selected. pipe = make_pipeline(StandardScaler(), PCA(), SelectKBest(k=2),
LogisticRegression())
pipe.fit(X, iris.target)
pipe[-1].input_features_
# array(['pca0', 'pca1'], dtype='<U4') |
I don't think that they have the same feature names is an issue. I think the issue is that information is lost in the first case. |
I think that's fine, the combination of feature names with the new pipeline visualizer should be enough for people to follow what happens inside. |
@adrinjalali but in the first case there's no way for you to know what the selection did. |
|
@adrinjalali That depends a lot on what we consider enough, I think. It means that if the user plots the feature names in the "standard" way by looking at what goes into LogisticRegression, they'll miss part of the story. |
@thomasjpfan you had a PR where you provided full input-output correspondences, right (or was it a comment somewhere)? That would allow tracking "everything" but might become very verbose. |
Co-Authored-By: Andreas Mueller <[email protected]>
The discussion has become pretty long, and I'm getting lost. Could we please merge this PR, and have the other required changes in different PRs, maybe by you @amueller ? ;) |
slep007/proposal.rst
Outdated
illustrates that very well:: | ||
|
||
|
||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we trim this down? I know I'm the one who suggested to add the example instead of link to it but I we mostly need the pipeline building, not the full example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we merge it and then you send a PR shortening the example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not.
Merging PRs with a restricted scope is fine, but merging PRs that are not fully addressed is a slippery slope that I think we should avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we had a chat with @glemaitre and @ogrisel I believe in Paris about this. And to us it seems the SLEP process needs some improvements.
One of the improvements we can think of, is that some of the discussions can happen in separate PRs once the first one is merged. Merging this PR doesn't mean it's accepted or goes for a vote or anything, it just means we can now talk about it in a more structured way. It's not like we have to have only one PR per SLEP.
Also, doing that would make it easier to follow the discussion and the changes. Right now it's really hard to follow the changes.
Another point is that merging them faster would make it easier to distribute the work between us for each SLEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to idea when it's about discussing complex idea (in a sense, that's what you're already doing by creating multiple SLEPs around feature names, and that's fine)
Less-so for simple changes like this one that only take a few minutes to be addressed. Creating a new PR for this will mostly create more work for all of us in general since it will still require a +2.
Please update the title to something less generic than "Feature Names" since other SLEPs are dealing with feature names. The PR title is a great candidate. Also Please add an abstract section at the top like the previous SLEPs or the template (Sorry if this is a duplicate, I'm certain to have made this comment yesterday but I don't see it anymore) |
I rather have that in a separate PR by you maybe? ;) I can't think of any good text for that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments.
I agree with @amueller that the LogTransformer is no different from the StandardScaler and they should be treated in the same section.
@adrinjalali concern seems to be that different values of the "verbose" parameter may have different defaults depending on the estimator. This sounds reasonable to me and that's not an issue IMO: the SLEP already states that the names will be generated on a case-by-case basis, and that the examples are merely general guidelines.
BTW, I would advocate for a verbose_feature_names
parameter instead of prefix_feature_names
. It would be applicable to any estimator for e.g. log(x0)
, not just the meta estimators.
@adrinjalali here's an attempt for the abstract.
This SLEP proposes the introduction of the
feature_names_in_
attribute for all estimators, and thefeature_names_out_
attribute for all transformers. We here discuss the generation of such attributes and their propagation through pipelines.
The next section could be "Motivations" (no need to change anything to the current text that you have written, it would perfectly fit already)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will catch up with all the discussion and then merge such that we can open subsequent PR on some specific points of the SLEP. |
Why? We can address the remaining points here, if any. The SLEP is reasonably small (and in good shape now) and I'd rather be reviewing one PR than many |
I have to go across 133 messages and unstack. For me, this is exactly the point raised by @adrinjalali there #17 (comment) which make it difficult to catch up with SLEP that I did not look at since a while. |
In the case of this PR there's only one discussion that hasn't been resolved in the diff.
I agree this is a problem. Personally, I try to provide timely feedback when I'm reviewing, and also try to address comments fast when I'm the author. It doesn't always work out though. Ideally, we would prioritize PRs better and allocate time, e.g. during the monthly meetings. I'm being grumpy but I think opening new PRs will mostly scatter discussions, making it even more difficult to catch up. And it may add more work for everyone. But OK, let's try for the SLEPs. I don't think we should be doing that for main repo though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have only 2 main remarks (without solutions :))
- a ``callable`` which would give further flexibility to the user to generate | ||
user defined feature names. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simpler than a callable would be to accept a format string with a specific language, i.e.
users may pass "wrapping_with_parenthesis({input_name})"
and on our side we would do verbose_feature_names.format(input_name=whatever_the_column_name_is)
so that the final name would be wrapping_with_parenthesis(f0)
A given string could actually be the default, to examplify the use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which would change the parameter name to names_format
whose default could be e.g. "log({input_name})"
for the LogTransformer or "pca_{input_name}"
for PCA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we can talk about alternatives, but this SLEP says we're not implementing any of them for now anyway.
slep007/proposal.rst
Outdated
********************* | ||
|
||
``verbose_feature_names`` controls the verbosity of the generated feature names | ||
and it can be ``True`` or ``False``. Alternative solutions could imagine: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and it can be ``True`` or ``False``. Alternative solutions could imagine: | |
and it can be ``True`` or ``False``. Alternative solutions could include: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it, but this is the language I liked and enjoyed, and I don't understand why you'd change it you understand what the text means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not trying to be an ass @adrinjalali .
I'm not an English native speaker and my understanding is influenced by my mother tongue: French. In French, only a living being can "imagine" something. The current sentence says that a "solution" may "imagine" something.
I believe it's incorrect in English, so I'm suggesting an alternative. I might be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure the previous version was not grammatical, I thought it was a typo.
Seems we're mostly happy with the current state of this SLEP. I'll merge to have it on the website, then wait a bit for more feedback, and hopefully soon call a vote on it :) |
This PR was merged with only one approval (mine), and with an unaddressed comment from @amueller https://github.com/scikit-learn/enhancement_proposals/pull/17/files#r373808675. How and when will this comment be addressed? I don't believe it's less worthy than other comments. |
|
Thanks. Just a tip, it might seem obvious to some but I know that it's not always the case (for having discussed with other devs): looking at the diff (https://github.com/scikit-learn/enhancement_proposals/pull/17/files) is a great way to see which comments are still unaddressed, compared to the main conversation page. There might still be comments in the conversation page but at least those in the diff are hard to miss. |
Since it seems the discussion in scikit-learn/scikit-learn#13307 requires a SLEP to decide on the style of feature names, and we talked about lowering the bar for a submitting a slep, I took the liberty to open one for us to discuss the styles and hopefully soon come to a consensus/conclusion.