Skip to content

Slep007 - feature names, their generation and the API #17

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Feb 14, 2020

Conversation

adrinjalali
Copy link
Member

Since it seems the discussion in scikit-learn/scikit-learn#13307 requires a SLEP to decide on the style of feature names, and we talked about lowering the bar for a submitting a slep, I took the liberty to open one for us to discuss the styles and hopefully soon come to a consensus/conclusion.

@amueller
Copy link
Member

amueller commented Apr 5, 2019

My biggest concern right now is what the feature names should actually be, and I think my doubts are mostly around creating new feature names within pipelines and column transformer.

@amueller
Copy link
Member

amueller commented Apr 5, 2019

I don't like the pca__pca0 name because it's redundant and seems like it would be a common occurrence. But maybe I'm too worried about it?

Maybe explaining better the mechanism for feature names would make it clearer?
I had posted some discussion in scikit-learn/scikit-learn#12627

There I talked about three cases:

a) The transformer does a non-trivial column transformation, like OneHotEncoder or feature selection
b) The transformer does "nothing" to the columns, like StandardScaler.
c) The transformer does a "too complex" operation to the columns, like PCA.

I guess a slight confusion I have right now is because there is another case, which is the ColumnTransformer, which also creates new feature names by first prepending the transformer name, and then passing through the column names - that's different than pipelines which don't prepend the estimator name.

My question right now is: do we always prepend the estimator name in ColumnTransformer, even if it generates things like pca__pca0 or standardscaler__sepallength.

@adrinjalali
Copy link
Member Author

regarding the redundancy of the feature names, we could

  • only add the prefix if there are duplicate names (which can also be problematic in some edge cases)
  • allow the user to set a function somewhere which would do the transformation based on the users' preferences.

@amueller
Copy link
Member

Not sure I really get the 2nd.
The 1st is basically an if in the column transformer that checks if a column is used multiple times. And then what?

@GaelVaroquaux
Copy link
Member

I've been playing with dask and pandas on very large datasets with many inconsistencies. The experience has really reminded me that code that does not have very consistent behavior leads to nasty bugs.



# I have to duplicate StandardScaler if I want to be able to
# use the pandas column names in the column transformer. Yuk!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you put the standard scaler in front of the column transformer here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is that "the user could do this", just to showcase what happens if they do. Not that it's the best way to do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you can not, as the comment suggests: this proposal is about creating feature names, not doing "pandas in, pandas out". StandardScaler will therefore create a numpy array as output, so ColumnTransformer can not use column names. A work-around would be to use column indices in the column transformer, or a boolean mask (explicitly using knowledge that StandardScaler will preserve columns).

@adrinjalali
Copy link
Member Author

The 1st is basically an if in the column transformer that checks if a column is used multiple times. And then what?

more like:

feature_names = []
for each transformer:
    feature_names.append(transformer.get_names())

if len(set(feature_names)) == len(feature_names) or not enforce_unique_names:
    return feature_names
else:
    append_prefixes_to_feature_names()

Not sure I really get the 2nd.

something along the lines of:

feature_names = {}
for each transformer:
    feature_names[transformer] = transformer.get_names()
if fname_trasformer is not None:
    return fname_transformer(feature_names)

@amueller
Copy link
Member

@GaelVaroquaux I'm not sure I understand what your comment about inconsistent behavior is about. Was that re @adrinjalali's suggestion of sometimes not appending the name?

@amueller
Copy link
Member

@adrinjalali The second one doesn't really make sense the way you wrote it, I think. If you want to replicate the prefix behavior, fname_transformer also needs to get the transformer name at least. I guess that's what you meant?
Are there other functions that would be sensible here apart from lambda t_name, f_name: t_name + f_name and lambda t_name, f_name: f_name?

I don't like 1) because it means that a feature name of an existing feature can change when I add additional features. That's quite counter-intuitive.
Though I guess you could argue the same is true for make_column_transformer names?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Apr 12, 2019 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Apr 12, 2019 via email

@amueller
Copy link
Member

I'm a bit annoyed at the duplication of the standard scaler, but I feel it will be hard to make this work better without blowing up the scope of the proposal, and it highlights a shortcoming of the current design; the shortcoming is unrelated to the feature name proposal, though, it's "only" related to dealing with pandas data frames.

@adrinjalali
Copy link
Member Author

Are there other functions that would be sensible here apart from lambda t_name, f_name: t_name + f_name and lambda t_name, f_name: f_name?

Those are the most reasonable options I can think of.

There's no discussion of what the vectorizers do with their input feature names, if anything. Is that even allowed?

Should we then have a parameter like output_feature_name_policy or something, which accepts a few values and based on that we generate the output feature names? It would be consistent throughout the package, so @GaelVaroquaux 's concern would be address I guess, and I suppose you would also fix your concerns with the verbose feature names @amueller ? Of course by default they're pretty verbose, but by passing an appropriate value to the parameter to some of the transformers/pipelines, the generated feature names would make sense?

@adrinjalali The second one doesn't really make sense the way you wrote it, I think. If you want to replicate the prefix behavior, fname_transformer also needs to get the transformer name at least. I guess that's what you meant?

Yes, exactly, we can even get it from the self if we add a Mixin? Shouldn't be too much chage.

@amueller
Copy link
Member

Should we then have a parameter like output_feature_name_policy or something, which accepts a few values and based on that we generate the output feature names?

parameter to what? each estimator? just the ColumnTransformer? global configuration (errrr probably not if column names will be used in actual behavior).

@adrinjalali
Copy link
Member Author

parameter to what? each estimator? just the ColumnTransformer? global configuration (errrr probably not if column names will be used in actual behavior).

A parameter used by feature_names_out(). I was thinking of having it in each estimator, it'd be easier for the user to tune the parameters. Otherwise if we have it only for Pipeline and ColumnTransformer, then we'd need to provide a way for the user to specify may be a {estimator: policy} dict, which doesn't seem easier from the user side to me.

What we can do is to expose a function set_ofnames_policy() in the Mixin, so that it can easily be discovered by the user. Then we can start by adding the param to estimators' __init__, starting by the ones which would need it more frequently. But it doesn't have to be in all of them to begin with.

@amueller
Copy link
Member

ok so you mean adding a parameter to the constructor.

I don't understand your argument about having to pass a dict. It's a constructor argument for a single estimator, right? So I know exactly which estimator I'm applying it to.

@adrinjalali
Copy link
Member Author

I don't understand your argument about having to pass a dict. It's a constructor argument for a single estimator, right? So I know exactly which estimator I'm applying it to.

If we accept the extra parameter to the constructor, then we don't need the dict. The dict was the alternative: if we reject the proposal to have the constructor argument, and want to have the policy argument only available to a few objects such as ColumnTransformer, Pipeline, etc, then we need the dict for a more granular control over the policies of each underlying object. To be clear, this is not what I'm proposing and I personally dislike it.

@thomasjpfan
Copy link
Member

A random thought: with this approach the following will have the same feature names

SelectKBest before PCA

Three of the original features were passed into the PCA.

pipe = make_pipeline(StandardScaler(), SelectKBest(k=3), PCA(n_components=2),
	                 LogisticRegression())
pipe.fit(X, iris.target)
pipe[-1].input_features_
# array(['pca0', 'pca1'], dtype='<U4')

SelectKBest after PCA

All the original features are passed into PCA, and two of the pca features are selected.

pipe = make_pipeline(StandardScaler(), PCA(), SelectKBest(k=2),
	                 LogisticRegression())
pipe.fit(X, iris.target)
pipe[-1].input_features_
# array(['pca0', 'pca1'], dtype='<U4')

@amueller
Copy link
Member

I don't think that they have the same feature names is an issue. I think the issue is that information is lost in the first case.

@adrinjalali
Copy link
Member Author

A random thought: with this approach the following will have the same feature names

I think that's fine, the combination of feature names with the new pipeline visualizer should be enough for people to follow what happens inside.

@amueller
Copy link
Member

@adrinjalali but in the first case there's no way for you to know what the selection did.

@adrinjalali
Copy link
Member Author

@adrinjalali but in the first case there's no way for you to know what the selection did.

pipe[1].feature_names_out_ will be the original features selected by that transformer. Is that not enough?

@amueller
Copy link
Member

@adrinjalali That depends a lot on what we consider enough, I think. It means that if the user plots the feature names in the "standard" way by looking at what goes into LogisticRegression, they'll miss part of the story.
But to provide the full information we would need to use much more elaborate/verbose feature names (like pca1(x_1, x_2)). If there's many features this will quickly get out of hand, though...

@amueller
Copy link
Member

@thomasjpfan you had a PR where you provided full input-output correspondences, right (or was it a comment somewhere)? That would allow tracking "everything" but might become very verbose.

Co-Authored-By: Andreas Mueller <[email protected]>
@adrinjalali adrinjalali changed the title Slep007 - the style of feature names Slep007 - feature names, their generation and the API Feb 10, 2020
@adrinjalali
Copy link
Member Author

The discussion has become pretty long, and I'm getting lost. Could we please merge this PR, and have the other required changes in different PRs, maybe by you @amueller ? ;)

illustrates that very well::


"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we trim this down? I know I'm the one who suggested to add the example instead of link to it but I we mostly need the pipeline building, not the full example

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we merge it and then you send a PR shortening the example?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not.
Merging PRs with a restricted scope is fine, but merging PRs that are not fully addressed is a slippery slope that I think we should avoid.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we had a chat with @glemaitre and @ogrisel I believe in Paris about this. And to us it seems the SLEP process needs some improvements.

One of the improvements we can think of, is that some of the discussions can happen in separate PRs once the first one is merged. Merging this PR doesn't mean it's accepted or goes for a vote or anything, it just means we can now talk about it in a more structured way. It's not like we have to have only one PR per SLEP.

Also, doing that would make it easier to follow the discussion and the changes. Right now it's really hard to follow the changes.

Another point is that merging them faster would make it easier to distribute the work between us for each SLEP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to idea when it's about discussing complex idea (in a sense, that's what you're already doing by creating multiple SLEPs around feature names, and that's fine)

Less-so for simple changes like this one that only take a few minutes to be addressed. Creating a new PR for this will mostly create more work for all of us in general since it will still require a +2.

@NicolasHug
Copy link
Member

Please update the title to something less generic than "Feature Names" since other SLEPs are dealing with feature names. The PR title is a great candidate.

Also Please add an abstract section at the top like the previous SLEPs or the template

(Sorry if this is a duplicate, I'm certain to have made this comment yesterday but I don't see it anymore)

@adrinjalali
Copy link
Member Author

Also Please add an abstract section at the top like the previous SLEPs or the template

(Sorry if this is a duplicate, I'm certain to have made this comment yesterday but I don't see it anymore)

I rather have that in a separate PR by you maybe? ;) I can't think of any good text for that.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments.

I agree with @amueller that the LogTransformer is no different from the StandardScaler and they should be treated in the same section.
@adrinjalali concern seems to be that different values of the "verbose" parameter may have different defaults depending on the estimator. This sounds reasonable to me and that's not an issue IMO: the SLEP already states that the names will be generated on a case-by-case basis, and that the examples are merely general guidelines.

BTW, I would advocate for a verbose_feature_names parameter instead of prefix_feature_names. It would be applicable to any estimator for e.g. log(x0), not just the meta estimators.

@adrinjalali here's an attempt for the abstract.

This SLEP proposes the introduction of the feature_names_in_ attribute for all estimators, and the feature_names_out_ attribute for all transformers. We here discuss the generation of such attributes and their propagation through pipelines.

The next section could be "Motivations" (no need to change anything to the current text that you have written, it would perfectly fit already)

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some last nitpicks.

I'm happy with the SLEP.

Thanks a lot @adrinjalali for all the hard work.

@glemaitre
Copy link
Member

OK, I will catch up with all the discussion and then merge such that we can open subsequent PR on some specific points of the SLEP.

@NicolasHug
Copy link
Member

Why? We can address the remaining points here, if any. The SLEP is reasonably small (and in good shape now) and I'd rather be reviewing one PR than many

@glemaitre
Copy link
Member

The SLEP is reasonably small (and in good shape now) and I'd rather be reviewing one PR than many.

I have to go across 133 messages and unstack. For me, this is exactly the point raised by @adrinjalali there #17 (comment) which make it difficult to catch up with SLEP that I did not look at since a while.

@NicolasHug
Copy link
Member

NicolasHug commented Feb 12, 2020

In the case of this PR there's only one discussion that hasn't been resolved in the diff.

I have to go across 133 messages and unstack

I agree this is a problem. Personally, I try to provide timely feedback when I'm reviewing, and also try to address comments fast when I'm the author. It doesn't always work out though. Ideally, we would prioritize PRs better and allocate time, e.g. during the monthly meetings.

I'm being grumpy but I think opening new PRs will mostly scatter discussions, making it even more difficult to catch up. And it may add more work for everyone. But OK, let's try for the SLEPs. I don't think we should be doing that for main repo though.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only 2 main remarks (without solutions :))

Comment on lines +121 to +122
- a ``callable`` which would give further flexibility to the user to generate
user defined feature names.
Copy link
Member

@NicolasHug NicolasHug Feb 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simpler than a callable would be to accept a format string with a specific language, i.e.

users may pass "wrapping_with_parenthesis({input_name})"

and on our side we would do verbose_feature_names.format(input_name=whatever_the_column_name_is)

so that the final name would be wrapping_with_parenthesis(f0)

A given string could actually be the default, to examplify the use

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which would change the parameter name to names_format whose default could be e.g. "log({input_name})" for the LogTransformer or "pca_{input_name}" for PCA

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can talk about alternatives, but this SLEP says we're not implementing any of them for now anyway.

*********************

``verbose_feature_names`` controls the verbosity of the generated feature names
and it can be ``True`` or ``False``. Alternative solutions could imagine:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and it can be ``True`` or ``False``. Alternative solutions could imagine:
and it can be ``True`` or ``False``. Alternative solutions could include:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it, but this is the language I liked and enjoyed, and I don't understand why you'd change it you understand what the text means.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not trying to be an ass @adrinjalali .

I'm not an English native speaker and my understanding is influenced by my mother tongue: French. In French, only a living being can "imagine" something. The current sentence says that a "solution" may "imagine" something.

I believe it's incorrect in English, so I'm suggesting an alternative. I might be wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure the previous version was not grammatical, I thought it was a typo.

@adrinjalali
Copy link
Member Author

Seems we're mostly happy with the current state of this SLEP. I'll merge to have it on the website, then wait a bit for more feedback, and hopefully soon call a vote on it :)

@adrinjalali adrinjalali merged commit 34a82fd into scikit-learn:master Feb 14, 2020
@adrinjalali adrinjalali deleted the slep007 branch February 14, 2020 09:49
@NicolasHug
Copy link
Member

This PR was merged with only one approval (mine), and with an unaddressed comment from @amueller https://github.com/scikit-learn/enhancement_proposals/pull/17/files#r373808675. How and when will this comment be addressed? I don't believe it's less worthy than other comments.

@adrinjalali
Copy link
Member Author

  1. Forgetting about that is one of the reasons I'd like to merge SLEPs much faster, to keep those discussions contained in their own issue/PR.
  2. This is now merged, not accepted, we can continue the discussion on separate issues. I'll open an issue with that question.

@NicolasHug
Copy link
Member

Thanks.

Just a tip, it might seem obvious to some but I know that it's not always the case (for having discussed with other devs): looking at the diff (https://github.com/scikit-learn/enhancement_proposals/pull/17/files) is a great way to see which comments are still unaddressed, compared to the main conversation page. There might still be comments in the conversation page but at least those in the diff are hard to miss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants