Slep007 - feature names, their generation and the API #17

adrinjalali · 2019-04-05T18:12:21Z

Since it seems the discussion in scikit-learn/scikit-learn#13307 requires a SLEP to decide on the style of feature names, and we talked about lowering the bar for a submitting a slep, I took the liberty to open one for us to discuss the styles and hopefully soon come to a consensus/conclusion.

amueller · 2019-04-05T20:27:06Z

My biggest concern right now is what the feature names should actually be, and I think my doubts are mostly around creating new feature names within pipelines and column transformer.

amueller · 2019-04-05T20:34:54Z

I don't like the pca__pca0 name because it's redundant and seems like it would be a common occurrence. But maybe I'm too worried about it?

Maybe explaining better the mechanism for feature names would make it clearer?
I had posted some discussion in scikit-learn/scikit-learn#12627

There I talked about three cases:

a) The transformer does a non-trivial column transformation, like OneHotEncoder or feature selection
b) The transformer does "nothing" to the columns, like StandardScaler.
c) The transformer does a "too complex" operation to the columns, like PCA.

I guess a slight confusion I have right now is because there is another case, which is the ColumnTransformer, which also creates new feature names by first prepending the transformer name, and then passing through the column names - that's different than pipelines which don't prepend the estimator name.

My question right now is: do we always prepend the estimator name in ColumnTransformer, even if it generates things like pca__pca0 or standardscaler__sepallength.

adrinjalali · 2019-04-10T17:03:05Z

regarding the redundancy of the feature names, we could

only add the prefix if there are duplicate names (which can also be problematic in some edge cases)
allow the user to set a function somewhere which would do the transformation based on the users' preferences.

amueller · 2019-04-10T19:31:59Z

Not sure I really get the 2nd.
The 1st is basically an if in the column transformer that checks if a column is used multiple times. And then what?

GaelVaroquaux · 2019-04-10T20:20:18Z

I've been playing with dask and pandas on very large datasets with many inconsistencies. The experience has really reminded me that code that does not have very consistent behavior leads to nasty bugs.

GaelVaroquaux · 2019-04-10T20:25:04Z

slep007/proposal.rst

+
+
+    # I have to duplicate StandardScaler if I want to be able to
+    # use the pandas column names in the column transformer. Yuk!


Couldn't you put the standard scaler in front of the column transformer here?

I think the idea is that "the user could do this", just to showcase what happens if they do. Not that it's the best way to do it.

No, you can not, as the comment suggests: this proposal is about creating feature names, not doing "pandas in, pandas out". StandardScaler will therefore create a numpy array as output, so ColumnTransformer can not use column names. A work-around would be to use column indices in the column transformer, or a boolean mask (explicitly using knowledge that StandardScaler will preserve columns).

adrinjalali · 2019-04-11T15:57:07Z

The 1st is basically an if in the column transformer that checks if a column is used multiple times. And then what?

more like:

feature_names = []
for each transformer:
    feature_names.append(transformer.get_names())

if len(set(feature_names)) == len(feature_names) or not enforce_unique_names:
    return feature_names
else:
    append_prefixes_to_feature_names()

Not sure I really get the 2nd.

something along the lines of:

feature_names = {}
for each transformer:
    feature_names[transformer] = transformer.get_names()
if fname_trasformer is not None:
    return fname_transformer(feature_names)

amueller · 2019-04-12T21:07:38Z

@GaelVaroquaux I'm not sure I understand what your comment about inconsistent behavior is about. Was that re @adrinjalali's suggestion of sometimes not appending the name?

amueller · 2019-04-12T21:12:57Z

@adrinjalali The second one doesn't really make sense the way you wrote it, I think. If you want to replicate the prefix behavior, fname_transformer also needs to get the transformer name at least. I guess that's what you meant?
Are there other functions that would be sensible here apart from lambda t_name, f_name: t_name + f_name and lambda t_name, f_name: f_name?

I don't like 1) because it means that a feature name of an existing feature can change when I add additional features. That's quite counter-intuitive.
Though I guess you could argue the same is true for make_column_transformer names?

GaelVaroquaux · 2019-04-12T21:25:40Z

@GaelVaroquaux I'm not sure I understand what your comment about inconsistent behavior is about. Was that re @adrinjalali's suggestion of sometimes not appending the name?

Yes.

GaelVaroquaux · 2019-04-12T21:26:21Z

No, you can not, as the comment suggests: this proposal is about creating feature names, not doing "pandas in, pandas out". StandardScaler will therefore create a numpy array as output, so ColumnTransformer can not use column names.

Point taken. Thank you.

amueller · 2019-04-12T21:28:35Z

I'm a bit annoyed at the duplication of the standard scaler, but I feel it will be hard to make this work better without blowing up the scope of the proposal, and it highlights a shortcoming of the current design; the shortcoming is unrelated to the feature name proposal, though, it's "only" related to dealing with pandas data frames.

adrinjalali · 2019-06-26T17:04:24Z

Are there other functions that would be sensible here apart from lambda t_name, f_name: t_name + f_name and lambda t_name, f_name: f_name?

Those are the most reasonable options I can think of.

There's no discussion of what the vectorizers do with their input feature names, if anything. Is that even allowed?

Should we then have a parameter like output_feature_name_policy or something, which accepts a few values and based on that we generate the output feature names? It would be consistent throughout the package, so @GaelVaroquaux 's concern would be address I guess, and I suppose you would also fix your concerns with the verbose feature names @amueller ? Of course by default they're pretty verbose, but by passing an appropriate value to the parameter to some of the transformers/pipelines, the generated feature names would make sense?

@adrinjalali The second one doesn't really make sense the way you wrote it, I think. If you want to replicate the prefix behavior, fname_transformer also needs to get the transformer name at least. I guess that's what you meant?

Yes, exactly, we can even get it from the self if we add a Mixin? Shouldn't be too much chage.

amueller · 2019-06-26T18:57:27Z

Should we then have a parameter like output_feature_name_policy or something, which accepts a few values and based on that we generate the output feature names?

parameter to what? each estimator? just the ColumnTransformer? global configuration (errrr probably not if column names will be used in actual behavior).

adrinjalali · 2019-06-27T08:39:06Z

parameter to what? each estimator? just the ColumnTransformer? global configuration (errrr probably not if column names will be used in actual behavior).

A parameter used by feature_names_out(). I was thinking of having it in each estimator, it'd be easier for the user to tune the parameters. Otherwise if we have it only for Pipeline and ColumnTransformer, then we'd need to provide a way for the user to specify may be a {estimator: policy} dict, which doesn't seem easier from the user side to me.

What we can do is to expose a function set_ofnames_policy() in the Mixin, so that it can easily be discovered by the user. Then we can start by adding the param to estimators' __init__, starting by the ones which would need it more frequently. But it doesn't have to be in all of them to begin with.

amueller · 2019-06-27T14:23:55Z

ok so you mean adding a parameter to the constructor.

I don't understand your argument about having to pass a dict. It's a constructor argument for a single estimator, right? So I know exactly which estimator I'm applying it to.

adrinjalali · 2019-06-28T09:29:39Z

I don't understand your argument about having to pass a dict. It's a constructor argument for a single estimator, right? So I know exactly which estimator I'm applying it to.

If we accept the extra parameter to the constructor, then we don't need the dict. The dict was the alternative: if we reject the proposal to have the constructor argument, and want to have the policy argument only available to a few objects such as ColumnTransformer, Pipeline, etc, then we need the dict for a more granular control over the policies of each underlying object. To be clear, this is not what I'm proposing and I personally dislike it.

thomasjpfan · 2019-07-17T18:28:01Z

A random thought: with this approach the following will have the same feature names

SelectKBest before PCA

Three of the original features were passed into the PCA.

pipe = make_pipeline(StandardScaler(), SelectKBest(k=3), PCA(n_components=2),
	                 LogisticRegression())
pipe.fit(X, iris.target)
pipe[-1].input_features_
# array(['pca0', 'pca1'], dtype='<U4')

SelectKBest after PCA

All the original features are passed into PCA, and two of the pca features are selected.

pipe = make_pipeline(StandardScaler(), PCA(), SelectKBest(k=2),
	                 LogisticRegression())
pipe.fit(X, iris.target)
pipe[-1].input_features_
# array(['pca0', 'pca1'], dtype='<U4')

amueller · 2019-07-17T18:31:59Z

I don't think that they have the same feature names is an issue. I think the issue is that information is lost in the first case.

adrinjalali · 2019-07-17T19:12:51Z

A random thought: with this approach the following will have the same feature names

I think that's fine, the combination of feature names with the new pipeline visualizer should be enough for people to follow what happens inside.

amueller · 2019-07-18T20:37:02Z

@adrinjalali but in the first case there's no way for you to know what the selection did.

adrinjalali · 2019-09-27T12:29:46Z

@adrinjalali but in the first case there's no way for you to know what the selection did.

pipe[1].feature_names_out_ will be the original features selected by that transformer. Is that not enough?

amueller · 2019-09-27T21:56:43Z

@adrinjalali That depends a lot on what we consider enough, I think. It means that if the user plots the feature names in the "standard" way by looking at what goes into LogisticRegression, they'll miss part of the story.
But to provide the full information we would need to use much more elaborate/verbose feature names (like pca1(x_1, x_2)). If there's many features this will quickly get out of hand, though...

amueller · 2019-09-27T21:57:29Z

@thomasjpfan you had a PR where you provided full input-output correspondences, right (or was it a comment somewhere)? That would allow tracking "everything" but might become very verbose.

Co-Authored-By: Andreas Mueller <[email protected]>

adrinjalali · 2020-02-10T16:39:32Z

The discussion has become pretty long, and I'm getting lost. Could we please merge this PR, and have the other required changes in different PRs, maybe by you @amueller ? ;)

NicolasHug · 2020-02-10T21:41:53Z

slep007/proposal.rst

+illustrates that very well::
+
+
+    """


Can we trim this down? I know I'm the one who suggested to add the example instead of link to it but I we mostly need the pipeline building, not the full example

Could we merge it and then you send a PR shortening the example?

I'd rather not.
Merging PRs with a restricted scope is fine, but merging PRs that are not fully addressed is a slippery slope that I think we should avoid.

we had a chat with @glemaitre and @ogrisel I believe in Paris about this. And to us it seems the SLEP process needs some improvements.

One of the improvements we can think of, is that some of the discussions can happen in separate PRs once the first one is merged. Merging this PR doesn't mean it's accepted or goes for a vote or anything, it just means we can now talk about it in a more structured way. It's not like we have to have only one PR per SLEP.

Also, doing that would make it easier to follow the discussion and the changes. Right now it's really hard to follow the changes.

Another point is that merging them faster would make it easier to distribute the work between us for each SLEP.

I'm open to idea when it's about discussing complex idea (in a sense, that's what you're already doing by creating multiple SLEPs around feature names, and that's fine)

Less-so for simple changes like this one that only take a few minutes to be addressed. Creating a new PR for this will mostly create more work for all of us in general since it will still require a +2.

NicolasHug · 2020-02-11T14:20:18Z

Please update the title to something less generic than "Feature Names" since other SLEPs are dealing with feature names. The PR title is a great candidate.

Also Please add an abstract section at the top like the previous SLEPs or the template

(Sorry if this is a duplicate, I'm certain to have made this comment yesterday but I don't see it anymore)

adrinjalali · 2020-02-11T14:43:59Z

Also Please add an abstract section at the top like the previous SLEPs or the template

(Sorry if this is a duplicate, I'm certain to have made this comment yesterday but I don't see it anymore)

I rather have that in a separate PR by you maybe? ;) I can't think of any good text for that.

…s into slep007

NicolasHug

A few more comments.

I agree with @amueller that the LogTransformer is no different from the StandardScaler and they should be treated in the same section.
@adrinjalali concern seems to be that different values of the "verbose" parameter may have different defaults depending on the estimator. This sounds reasonable to me and that's not an issue IMO: the SLEP already states that the names will be generated on a case-by-case basis, and that the examples are merely general guidelines.

BTW, I would advocate for a verbose_feature_names parameter instead of prefix_feature_names. It would be applicable to any estimator for e.g. log(x0), not just the meta estimators.

@adrinjalali here's an attempt for the abstract.

This SLEP proposes the introduction of the feature_names_in_ attribute for all estimators, and the feature_names_out_ attribute for all transformers. We here discuss the generation of such attributes and their propagation through pipelines.

The next section could be "Motivations" (no need to change anything to the current text that you have written, it would perfectly fit already)

under_review.rst

slep007/proposal.rst

NicolasHug

Some last nitpicks.

I'm happy with the SLEP.

Thanks a lot @adrinjalali for all the hard work.

slep007/proposal.rst

glemaitre · 2020-02-12T18:08:37Z

OK, I will catch up with all the discussion and then merge such that we can open subsequent PR on some specific points of the SLEP.

NicolasHug · 2020-02-12T18:21:11Z

Why? We can address the remaining points here, if any. The SLEP is reasonably small (and in good shape now) and I'd rather be reviewing one PR than many

glemaitre · 2020-02-12T19:01:47Z

The SLEP is reasonably small (and in good shape now) and I'd rather be reviewing one PR than many.

I have to go across 133 messages and unstack. For me, this is exactly the point raised by @adrinjalali there #17 (comment) which make it difficult to catch up with SLEP that I did not look at since a while.

NicolasHug · 2020-02-12T19:13:35Z

In the case of this PR there's only one discussion that hasn't been resolved in the diff.

I have to go across 133 messages and unstack

I agree this is a problem. Personally, I try to provide timely feedback when I'm reviewing, and also try to address comments fast when I'm the author. It doesn't always work out though. Ideally, we would prioritize PRs better and allocate time, e.g. during the monthly meetings.

I'm being grumpy but I think opening new PRs will mostly scatter discussions, making it even more difficult to catch up. And it may add more work for everyone. But OK, let's try for the SLEPs. I don't think we should be doing that for main repo though.

glemaitre

I have only 2 main remarks (without solutions :))

slep007/proposal.rst

NicolasHug · 2020-02-13T15:03:54Z

slep007/proposal.rst

+- a ``callable`` which would give further flexibility to the user to generate
+  user defined feature names.


simpler than a callable would be to accept a format string with a specific language, i.e.

users may pass "wrapping_with_parenthesis({input_name})"

and on our side we would do verbose_feature_names.format(input_name=whatever_the_column_name_is)

so that the final name would be wrapping_with_parenthesis(f0)

A given string could actually be the default, to examplify the use

Which would change the parameter name to names_format whose default could be e.g. "log({input_name})" for the LogTransformer or "pca_{input_name}" for PCA

yes, we can talk about alternatives, but this SLEP says we're not implementing any of them for now anyway.

NicolasHug · 2020-02-13T15:32:05Z

slep007/proposal.rst

+*********************
+
+``verbose_feature_names`` controls the verbosity of the generated feature names
+and it can be ``True`` or ``False``. Alternative solutions could imagine:


Suggested change

and it can be ``True`` or ``False``. Alternative solutions could imagine:

and it can be ``True`` or ``False``. Alternative solutions could include:

I changed it, but this is the language I liked and enjoyed, and I don't understand why you'd change it you understand what the text means.

I'm not trying to be an ass @adrinjalali .

I'm not an English native speaker and my understanding is influenced by my mother tongue: French. In French, only a living being can "imagine" something. The current sentence says that a "solution" may "imagine" something.

I believe it's incorrect in English, so I'm suggesting an alternative. I might be wrong.

I'm pretty sure the previous version was not grammatical, I thought it was a typo.

adrinjalali · 2020-02-14T09:49:16Z

Seems we're mostly happy with the current state of this SLEP. I'll merge to have it on the website, then wait a bit for more feedback, and hopefully soon call a vote on it :)

NicolasHug · 2020-02-14T12:17:02Z

This PR was merged with only one approval (mine), and with an unaddressed comment from @amueller https://github.com/scikit-learn/enhancement_proposals/pull/17/files#r373808675. How and when will this comment be addressed? I don't believe it's less worthy than other comments.

adrinjalali · 2020-02-14T13:23:38Z

Forgetting about that is one of the reasons I'd like to merge SLEPs much faster, to keep those discussions contained in their own issue/PR.
This is now merged, not accepted, we can continue the discussion on separate issues. I'll open an issue with that question.

NicolasHug · 2020-02-14T13:36:11Z

Thanks.

Just a tip, it might seem obvious to some but I know that it's not always the case (for having discussed with other devs): looking at the diff (https://github.com/scikit-learn/enhancement_proposals/pull/17/files) is a great way to see which comments are still unaddressed, compared to the main conversation page. There might still be comments in the conversation page but at least those in the diff are hard to miss.

adrinjalali added 3 commits April 5, 2019 20:08

initial 007

51fc476

fix typo

28d7b84

fix code block

3fcbefa

adrinjalali mentioned this pull request Apr 5, 2019

Feature names with input features scikit-learn/scikit-learn#13307

Closed

GaelVaroquaux reviewed Apr 10, 2019

View reviewed changes

animeshsingh mentioned this pull request Jun 25, 2019

Make AIF360 default bias checker and mitigator in scikit-learn scikit-learn/scikit-learn#14181

Open

11 tasks

thomasjpfan mentioned this pull request Jul 17, 2019

feature names - NamedArray scikit-learn/scikit-learn#14315

Closed

Update slep007/proposal.rst

081ed93

Co-Authored-By: Andreas Mueller <[email protected]>

adrinjalali changed the title ~~Slep007 - the style of feature names~~ Slep007 - feature names, their generation and the API Feb 10, 2020

NicolasHug reviewed Feb 10, 2020

View reviewed changes

adrinjalali added 3 commits February 11, 2020 15:44

change the title

1baea78

Merge branch 'slep007' of github.com:adrinjalali/enhancement_proposal…

51de1f7

…s into slep007

shorted example

50c6538

NicolasHug reviewed Feb 11, 2020

View reviewed changes

under_review.rst Show resolved Hide resolved

slep007/proposal.rst Outdated Show resolved Hide resolved

slep007/proposal.rst Outdated Show resolved Hide resolved

slep007/proposal.rst Outdated Show resolved Hide resolved

address Nicolas's comments, remove onetoone mapping

c37d9d6

NicolasHug approved these changes Feb 11, 2020

View reviewed changes

address Nicolas's comments

17fe3d7

glemaitre reviewed Feb 13, 2020

View reviewed changes

slep007/proposal.rst Outdated Show resolved Hide resolved

slep007/proposal.rst Outdated Show resolved Hide resolved

slep007/proposal.rst Show resolved Hide resolved

adrinjalali mentioned this pull request Feb 13, 2020

slep000, slep workflow #30

Merged

trying to address Guillaume's comments

6b5533c

NicolasHug reviewed Feb 13, 2020

View reviewed changes

imagine -> include

4ed249c

adrinjalali merged commit 34a82fd into scikit-learn:master Feb 14, 2020

adrinjalali deleted the slep007 branch February 14, 2020 09:49

NicolasHug mentioned this pull request Feb 18, 2020

SLEP013: n_features_out_ #29

Merged



		# I have to duplicate StandardScaler if I want to be able to
		# use the pandas column names in the column transformer. Yuk!

		- a ``callable`` which would give further flexibility to the user to generate
		user defined feature names.

	and it can be ``True`` or ``False``. Alternative solutions could imagine:
	and it can be ``True`` or ``False``. Alternative solutions could include:

Slep007 - feature names, their generation and the API #17

Slep007 - feature names, their generation and the API #17

Conversation

adrinjalali commented Apr 5, 2019

amueller commented Apr 5, 2019

amueller commented Apr 5, 2019

adrinjalali commented Apr 10, 2019

amueller commented Apr 10, 2019

GaelVaroquaux commented Apr 10, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali commented Apr 11, 2019

amueller commented Apr 12, 2019

amueller commented Apr 12, 2019

GaelVaroquaux commented Apr 12, 2019 via email

GaelVaroquaux commented Apr 12, 2019 via email

amueller commented Apr 12, 2019

adrinjalali commented Jun 26, 2019

amueller commented Jun 26, 2019

adrinjalali commented Jun 27, 2019

amueller commented Jun 27, 2019

adrinjalali commented Jun 28, 2019

thomasjpfan commented Jul 17, 2019

SelectKBest before PCA

SelectKBest after PCA

amueller commented Jul 17, 2019

adrinjalali commented Jul 17, 2019

amueller commented Jul 18, 2019

adrinjalali commented Sep 27, 2019

amueller commented Sep 27, 2019

amueller commented Sep 27, 2019

adrinjalali commented Feb 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug commented Feb 11, 2020

adrinjalali commented Feb 11, 2020

NicolasHug left a comment

Choose a reason for hiding this comment

NicolasHug left a comment

Choose a reason for hiding this comment

glemaitre commented Feb 12, 2020

NicolasHug commented Feb 12, 2020

glemaitre commented Feb 12, 2020

NicolasHug commented Feb 12, 2020 • edited Loading

glemaitre left a comment

Choose a reason for hiding this comment

NicolasHug Feb 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali commented Feb 14, 2020

NicolasHug commented Feb 14, 2020

adrinjalali commented Feb 14, 2020

NicolasHug commented Feb 14, 2020

NicolasHug commented Feb 12, 2020 •

edited

Loading

NicolasHug Feb 13, 2020 •

edited

Loading