SLEP 8: Propagating feature names #18

jorisvandenbossche · 2019-05-28T19:26:42Z

With much delay, I did a quick clean-up of the draft I wrote beginning of March at the end of the sprint. So here is an initial version of the SLEP on propagating feature names through pipelines.

The PR implementing it is scikit-learn/scikit-learn#13307

adrinjalali · 2019-05-31T18:06:14Z

slep006.rst

+gets recursively called on each step of a ``Pipeline`` so that the feature
+names get propagated throughout the full ``Pipeline``. This will allow to
+inspect the input and output feature names in each step of a ``Pipeline``.
+


We talked about having it propagated through the pipeline as the pipeline goes, so that in each step of the pipeline the model could potentially use those names. That's slightly different than recursively calling it to get the names once the pipeline has been fit.

Yes we should mention that but maybe you can provide a suggestion for motivation and implementation?

We talked about having it propagated through the pipeline as the pipeline goes, so that in each step of the pipeline the model could potentially use those names.

That's maybe partly related to what I mentioned below in one of the questions about standalone estimators (not in a pipeline). If we want those to behave similarly, the fit method of the estimator needs to do something (at least, with the current proposal, calling the "update feature names" method). But if we actually let fit handle the actual feature name logic (needed for the above suggestion), that directly solves the issue of standalone vs within-pipeline consistency.

adrinjalali · 2019-05-31T18:12:16Z

slep006.rst

+potentially removing the need to have an explicit output feature names *getter
+method*. The "update feature names" method would then mainly be used for
+setting the input features and making sure they get propagated. 
+


+1 for having them [almost] everywhere.

adrinjalali · 2019-05-31T18:13:50Z

slep006.rst

+standing estimators and Pipelines.  However, the clear downside of this
+consistency is that this would add one line to each ``fit`` method throughout
+scikit-learn.
+


there's also the option of having a fit which does some common tasks such as setting the feature names, and letting the child classes only implement _fit. It kinda goes along the lines of what's being done in scikit-learn/scikit-learn#13603

adrinjalali · 2019-05-31T18:17:56Z

There was also the concern that the user may want to disable this propagation. (I think this SLEP hasn't addressed that case yet).

amueller · 2019-05-31T18:19:25Z

There was also the concern that the user may want to disable this propagation. (I think this SLEP hasn't addressed that case yet).

can you elaborate? I don't remember that part.

adrinjalali · 2019-05-31T18:20:39Z

can you elaborate? I don't remember that part.

I think it was specifically in the context of NLP related usecases where the whole "dictionary" becomes the features and it may be very memory intensive to store them. IIRC @jnothman raised the concern.

jnothman · 2019-06-03T08:13:14Z

I think it was specifically in the context of NLP related usecases where

the whole "dictionary" becomes the features and it may be very memory intensive to store them. IIRC @jnothman <https://github.com/jnothman> raised the concern. Well our countvectorizer is currently quite naive about this, wrt ngrams, but we would be forcing this idea of storing feature names in sparse spaces on people who have been using other vectorization tools.

amueller

I think the slep needs a list of use-cases, in particular comparing the pandas and non-pandas one and checking if there's other relevant cases.
Do we ever actually change feature names that have been set? Maybe to simplify them?

slep006.rst

amueller · 2019-06-26T16:00:07Z

slep006.rst

+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The core idea of this proposal is that all transformers get a transformative
+*"update feature names"* method that can determine the output feature names,


Maybe say that the name is up for discussion?

amueller · 2019-06-26T16:00:29Z

slep006.rst

+The ``Pipeline`` and all meta-estimators implement this method by calling it
+recursively on all its child steps / sub-estimators, and in this way the input
+features names get propagated through the full pipeline. In addition, it sets
+the ``input_feature_names_`` attribute on each step of the pipeline.


Maybe explain why that is necessary?

amueller · 2019-06-26T16:01:23Z

slep006.rst

+features names get propagated through the full pipeline. In addition, it sets
+the ``input_feature_names_`` attribute on each step of the pipeline.
+
+A Pipeline calls the method at the end of ``fit()`` (using the DataFrame column


As @adrinjalali says, there is also the possibility to set them during fit.

slep006.rst

amueller · 2019-06-26T16:25:39Z

slep006.rst

+- Transformers based on arbitrary functions
+
+
+Should all estimators (so including regressors, classifiers, ...) have a "update feature names" method?


I think this method is not properly motivated in the SLEP.
The use case is that X has no column names, right? We fitted a pipeline on a numpy array and we also have feature names and now we want to get the output features.

It might make sense to distinguish the cases where X contains the feature names in the column and where it doesn't because in the first case everything can be automatic.

amueller · 2019-06-26T16:26:09Z

slep006.rst

+
+For consistency, we could also add them to *all* estimators.
+
+For a regressor or classifier, the method could set the ``input_feature_names_``


Why? you mean the output feature names, right?

Regressors don't have output features?

But in general, I am not fully sure anymore what I was thinking here for this section. It all depends on where what we decide where the responsibility lies to set the attributes (does parent pipeline set the attributes and then does the "update feature names" method look first at the attribute, or does the parent pipeline pass the names to the "update feature names" method which then sets the attribute, or ...)

amueller · 2019-06-26T16:27:51Z

slep006.rst

+----------------------
+
+This SLEP does not affect backward compatibility, as all described attributes
+and methods would be new ones, not affecting existing ones.


Well if we reuse get_feature_names then we add a new parameter in some cases but the old behavior still works.

slep006.rst

amueller · 2019-06-26T16:33:56Z

slep006.rst

+with a set of custom input feature names that are not identical to the original
+DataFrame column names, the stored column names to do validation and the stored
+column names to propagate the feature names would get out of sync. Or should
+calling ``get_feature_names`` also affect future validation in a ``predict()``


I would vote for this option.

amueller · 2019-06-26T16:34:40Z

There's no discussion of what the vectorizers do with their input feature names, if anything. Is that even allowed?

amueller · 2019-07-02T17:16:34Z

slep006.rst

+transformative "update feature names" method and calling it recursively in the
+Pipeline setting the ``input_feature_names_`` attribute on each step.
+
+1. Only implement the "update feature names" method and require the user to


I'm not sure if this is the correct distinction but the main point is to always just operate on output feature names, never on input feature names, right?

I don't fully understand this comment. What do you man with "operating on output feature names" ?

(this alternative of course depends on the idea of having such a "update feature names" method that does the work, but if we decide that actually it should happen in fit that would change things)

amueller · 2019-07-02T18:23:59Z

Coming back to this, I feel like I now favor a more invasive approach.
What I'm mostly thinking about right now is how feature names can enter an estimator. I feel if they enter any other way than fit, the user might call fit with a mismatching X and names and results will be inconsistent.
So there would be a benefit in providing feature names only via fit. If X is a dataframe, that's easy. If X is not a dataframe, I can see two options, both of which are very invasive:
a) add a feature_names_in parameter to fit (everywhere)
b) Create a ndarry subclass that has a feature_name attribute that stores the feature names.

a) requires a lot of code changes but is pretty nice otherwise, while b) requires no code changes to the fit methods, but creates a new custom class, which could be pretty confusing.

If we have always the feature names in fit, we can also do the transformation in fit, and so the user never has to call any method.

I feel that it would be good if we can eliminate having a public method. That means if you need to change your feature names, you have to refit again. An expert might use private methods to prevent this, but I think it's not that important a use-case.
I think it's more important to ensure that feature names and actual computation are in sync.

Therefore my preferred solution right now:

Ensure feature_names_in is available during fit
Set feature_names_out at the end of fit
have the pipeline pass the out from the previous step to the in from the next step
profit

The main question is then how to implement 1) in the case of numpy arrays, and I think the three options are setting it beforehand, passing it in as an argument, and passing it in as a custom class.

Given that 2) requires touching every fit (of a transformer) anyway, maybe passing it in as an argument is easiest? I'm unsure about adding a custom class, but I don't really like the "set it before fit" that is currently implemented because it's very implicit and side-effect-y.

amueller · 2019-07-02T19:16:50Z

The downside of passing the feature names into fit as separate argument is that it requires special-casing passthrough in the pipeline which is a mess :-/ but well..

jnothman · 2019-07-02T23:24:38Z

I don't think special-casing passthrough is as big a deal as changing the current convention that things passed to fit are sample-aligned. That affects a whole bunch of meta-estimators, including any defined outside of sklearn, doesn't it? Passing the names to fit is by far the most explicit design. I am curious to know what subclassing looks like... But it makes me think that if it weren't for sparse matrices, we'd be better off just using DataFrames for interchange with column names, since wrapping and unwrapping should be fast with homogeneous dtype...

amueller · 2019-07-03T02:12:14Z

@jnothman I thought there was a concern that pandas might become a column store and wrapping and unwrapping become non-trivial operations? @jorisvandenbossche surely knows more.

That would be kind of "pandas in"-"pandas out" then, but only applied to X - which would allow a lot of things already.

Yes the passthrough is not actually a big deal, I was just annoyed I actually have to think about the code ;)

Not having sample-aligned fit parameters certainly breaks with tradition. If non-sklearn meta-estimators handle them as kwargs they should assume they are sample aligned, so this will break backward-compatibility.

Which might show that allowing kwargs wasn't / isnt a good way to do sample props?
In a way passing things through fit is basically "attaching properties to features/columns", only it looks to me as if the routing questions are much easier (or just different) than with sample props.

We could go even a step further and do feature_props={'names': feature_names} but for now that seems like unnecessary indirections.

I haven't finished writing the subclassing thing, but I think it's pretty trivial. We add a function make_named_array(X, feature_names) and that creates an object of type NamedNdarray that has an attribute feature_names and we expect the user to do that if they don't want to use pandas as input, and we wrap the output of transformers with that.
It's basically the same as pandas-in, pandas-out, only that we ensure there's really zero copy and it's future-proof as long as we rely on numpy.

BTW any of these designs get rid of the weird subestimator discovery I had to do, because now everything is done in fit, as it was meant to be ;)

amueller · 2019-07-03T15:44:45Z

Asked pandas: pandas-dev/pandas#27211

amueller · 2019-07-03T16:19:56Z

Also asked xarray:
pydata/xarray#3077

I think the answer from pandas is as I remembered which is they might change to using 1d slices, and while conversion to pandas might be possible, coming back is not possible without a copy.

It seems a bit unnatural to me if we'd produce DataArrays if someone passes in a DataFrame but if DataFrame is not an option then we should consider it.

amueller · 2019-07-03T17:43:02Z

@jorisvandenbossche also brought up the option of natively supporting pandas in some estimators and not requiring conversion at all. That might be possible for many of the preprocessing transformers, but not for all. It's certainly desirable to avoid copies, but I don't think it'll provide a full solution to the feature names issue.
Also, I assume it will require a lot of code for many of the estimators as pandas dataframes and numpy arrays likely require different codepaths even for simple things like scaling.

tldr; not casting pandas to numpy when we don't have to would be nice, but it probably won't solve the feature name issue.

adrinjalali · 2019-07-05T13:06:40Z

If we go down the path of having a NamedNdarray, then why not keep the sample props also right attached to the array? If we do that, it starts getting closer to the xarray's implementation, so to me it seems like a better solution to just support/use the xarray.DataArray. (They even have a Dataset object BTW).

@jnothman do you see major drawbacks to using an xarray.DataArray like object to pass around sample props? I know it'd be tricky to handle the case where we want a scorer or an estimator to ignore a sample property but the prop is there, but that can be handled by a metaestimator removing those attributes before feeding the data to the estimator.

@amueller are you already working on a NamedNdarray like solution?

amueller · 2019-07-05T15:04:34Z

Stephan said in pydata/xarray#3077 that it's a core property of xarray to do zero copy for numeric dtypes, so I think it would be a feasible candidate.

@adrinjalali attaching sample props to X wouldn't solve the routing issues, right, as in where to use sample_weights etc? So there's definitely some benefit but the hard problems are not actually addressed :-/

I haven't started implementing a NamedNDarray solution and I think right now I would prefer xarray.DataArray. I'm still not convinced that fit arguments are bad, though they have backward compatibility issues as @jnothman pointed out.

Right now, ColumnTransformer works on xarray.DataArray if we call the column dimension columns. Is that something we would want to enforce? That would allow us to write similar code to support pandas and xarray, but it might be a bit odd from an xarray perspective? From an sklearn view it would probably make the most sense to call the two dimensions samples and features, but then we would need to do more duck-typing for xarray vs pandas.

slep006.rst

GaelVaroquaux · 2019-10-27T22:04:39Z

Alternative: create a scikit-learn 1.0 beta with feature names using duck arrays relying on numpy 1.7 ;)

Late to the party (catching up): no problem on the numpy 1.7 requirement.

shoyer · 2019-10-27T22:06:00Z

Alternative: create a scikit-learn 1.0 beta with feature names using duck arrays relying on numpy 1.7 ;)
Late to the party (catching up): no problem on the numpy 1.7 requirement.

I think the requirement is actually NumPy 1.17 for __array_function__?

GaelVaroquaux · 2019-10-27T22:39:59Z

Late to the party (catching up): no problem on the numpy 1.7 requirement. I think the requirement is actually NumPy 1.17 for __array_function__?

Yes, at some point (after writing this), my brain reconnected and became less dumb. I'm very much not in favor of to bumping the requirement to 1.17 for 1beta. It's important to have in mind that a core contribution of scikit-learn to the ecosystem is good numerical algorithms. In my view, it is the most important one. Sugar on top that makes these good numerical algorithms easier to use is good. However, it is no use without the good numerics. Making the numerical algorithms harder to access (for instance with hard-to-meet requirements) for sugar on top is narrowing their impact on scientific and data applications.

adrinjalali · 2019-10-28T10:23:53Z

I'm not sure how using __array_function__ would hamper people's access to the numerics @GaelVaroquaux . People can still rely on older sklearn if they really want to use old numpy. Besides, if we're talking about v1.0, we should be able to think about such changes.

jorisvandenbossche · 2019-10-28T12:01:39Z

What's the status of this SLEP? Is it blocked by #22?

I don't think it is directly blocked by #22 (n_features_in_) (there are a few overlapping aspects, eg regarding naming, but I think that is not touching the essential discussion here).

I think the status is that we should update the SLEP with the alternative being discussed above (or, in case that is preferred, write an alternative SLEP for it, but personally I think it would be good to describe the possible options in this SLEP).

adrinjalali · 2019-10-28T14:18:06Z

I think the alternative which we mostly agree with, is the one proposed in scikit-learn/scikit-learn#14315. We either need a new slep, or to update this one to reflect the the ideas there.

GaelVaroquaux · 2019-10-28T15:11:24Z

People can still rely on older sklearn if they really want to use old numpy.

Telling users that they can either use old versions of sklearn or upgrade everything makes it hard for users to have a buffer where they progressively do upgrades. In an ideal world, upgrades should be minor events that we do, like cleaning our room (I'm terrible at cleaning my room). The problem is when upgrades become large endeavors. First, we need to schedule significant time slots for them, second more things are likely to go wrong together. For instance: if upgrading scikit-learn triggers a significant upgrade in numpy, which itself triggers a significant upgrade in pandas, which itself comes with behavior changes. It's then harder for the user to audit the changes on his analysis or production tasks. The user is more likely to delay the upgrade, and the problem becomes worse.

Besides, if we're talking about v1.0, we should be able to think about such changes.

It's all a question of leaving the buffer: if v1.0 happens in two years, then yes. I think that an ideal situation would be if we can gradually ramp up the requirements of scikit-learn. In other terms: if our requirements are versions that are old enough for users to have had time to adjust, typically because they are installed by the rolling upgrade of company policy (which is unfortunately something that varies widely).

amueller · 2019-11-06T18:04:44Z

@GaelVaroquaux My idea was to have a config option that explicitly enables the new behavior and that this config would fail with older numpy (i.e. the config option would have a soft dependency on numpy 1.17)
I would definitely not trigger an update.

If you have an idea how to implement something like names arrays without numpy 1.17 I think we're all ear. The alternative would be to implement feature names via a different mechanism.

amueller · 2019-11-06T18:07:08Z

@hermidalc I agree that more metadata would be nice, but actually we haven't even solved this problem with **fit_parms (see #16). Adding additional metadata later will probably be relatively straight-forward if/when we agree on a mechanism.
What do you do with the meta-data for PolynomialFeatures or PCA?

amueller · 2019-11-06T18:07:52Z

I would suggest we write a separate new slep on the NamedArray. @adrinjalali do you want to do that? You could reuse parts of this one if appropriate.

GaelVaroquaux · 2019-11-06T19:02:40Z

@GaelVaroquaux My idea was to have a config option that explicitly enables the new behavior and that this config would fail with older numpy

I think that this is a good suggestion!

adrinjalali · 2019-11-11T09:21:45Z

To be clear, the current implementation of NamedArray only requires numpy 1.13, which is what we'll have once we bump our python support to 3.6 anyway. So that's not a big issue. However, if we want to include more features in the NamedArray as @thomasjpfan has also suggested, then we'll need 1.17.

@amueller yes I'll write up a new SLEP.

lorentzenchr · 2022-05-06T11:51:44Z

Superseded by the acceptance of SLEP007 in #59. (Correct me if I'm wrong.)

adrinjalali · 2022-05-06T12:38:41Z

This SLEP is about how to propagate feature names, whereas SLEP007 is about how to generate them. @thomasjpfan might have something which would supersede this.

thomasjpfan · 2022-05-06T14:36:49Z

I think SLEP007 covers both generation and propagation and supersedes this SLEP. In SLEP007's abstract it states:

We here discuss the generation of such attributes and their propagation through pipelines.

Also the feature name propagation described in SLEP007 is consistent with the implementation on main.

adrinjalali · 2022-05-06T14:52:31Z

We still don't really have a way to propagate feature names during fit and transform, and SLEP007 does not talk about that.

thomasjpfan · 2022-05-06T15:13:00Z

This SLEP does not propose propagating names during fit or transform. This SLEP's abstract states:

This SLEP proposes to add a transformative get_feature_names() method that
gets recursively called on each step of a Pipeline so that the feature
names get propagated throughout the full Pipeline. This will allow to
inspect the input and output feature names in each step of a Pipeline.

adrinjalali · 2022-05-06T15:23:54Z

It's been discussed in the conversations, not addressed maybe, e.g.: #18 (comment)

lorentzenchr · 2022-05-08T10:47:53Z

I thought I do some spring cleaning 🧹. From a quick read over the text, I did not notice a difference to the now (v1.1) implemented get_feature_names_out. If this SLEP is about making feature names available during fit time, that would be great, but should be better reflected in the actual text (not only in github comments).

Maybe, we can discuss our SLEP plans during the next dev meeting?

ogrisel · 2022-05-30T15:25:25Z

SLEP 18 with pandas out will be a partial solution to the problem of propagating features names in pipelines at fit time when all transformers output dense values that naturally fit in a pandas dataframe.

Out to propagate feature names for transformers that typically output sparse matrices (e.g. OneHotEncoder, KBinsDiscretizer, PolynomialFeatures, maybe SplineTransformer in the future...) was not fully resolved at the meeting and was left as a next step.

adrinjalali · 2022-06-01T13:07:39Z

Happy to have SLEP18 instead of this one then.

hermidalc · 2022-06-04T14:49:00Z

SLEP 18 with pandas out will be a partial solution to the problem of propagating features names in pipelines at fit time when all transformers output dense values that naturally fit in a pandas dataframe.

Apologies I've been out of the loop for a while regarding SLEP enhancements, I thought this was implemented the current release? So you cannot get feature names out of the end of a Pipeline and access them at each step in fit?

adrinjalali · 2022-06-07T13:53:35Z

Right now you can only access the feature names outside the fit/transform methods after you fit a Pipeline. We're working towards making it available during fit as well.

jorisvandenbossche added 2 commits May 28, 2019 17:09

add initial version

d7f804a

clean-up text

79c51dd

jorisvandenbossche mentioned this pull request May 28, 2019

Feature names with input features scikit-learn/scikit-learn#13307

Closed

adrinjalali reviewed May 31, 2019

View reviewed changes

lorentzenchr mentioned this pull request Jun 9, 2019

[MRG] META Add Generalized Linear Models scikit-learn/scikit-learn#9405

Closed

8 tasks

lrjball mentioned this pull request Jun 19, 2019

ENH ColumnTransformer.get_feature_names() handles passthrough scikit-learn/scikit-learn#14048

Merged

animeshsingh mentioned this pull request Jun 25, 2019

Make AIF360 default bias checker and mitigator in scikit-learn scikit-learn/scikit-learn#14181

Open

11 tasks

amueller approved these changes Jun 26, 2019

View reviewed changes

amueller reviewed Jun 26, 2019

View reviewed changes

amueller mentioned this pull request Jun 26, 2019

Pandas in, Pandas out? scikit-learn/scikit-learn#5523

Closed

amueller reviewed Jul 2, 2019

View reviewed changes

amueller mentioned this pull request Jul 2, 2019

RFC/WIP Feature names within fit scikit-learn/scikit-learn#14238

Closed

adrinjalali mentioned this pull request Jul 12, 2019

feature names - NamedArray scikit-learn/scikit-learn#14315

Closed

adrinjalali reviewed Jul 13, 2019

View reviewed changes

slep006.rst Outdated Show resolved Hide resolved

NicolasHug mentioned this pull request Dec 2, 2019

What's the recommended approach for building a complete data frame (feature values + names) after using ColumTransformer/ FeatureUnion? scikit-learn/scikit-learn#15755

Open

jhamman mentioned this pull request Apr 17, 2020

[Proposal] Expose Variable without Pandas dependency pydata/xarray#3981

Open

NicolasHug mentioned this pull request Sep 23, 2020

Add a get_transformed_matrix_feature_names to ColumnTransformer scikit-learn/scikit-learn#18439

Closed

lorentzenchr mentioned this pull request Dec 6, 2021

SLEP015: Feature Names Propagation #48

Merged

lorentzenchr closed this May 6, 2022

adrinjalali reopened this May 6, 2022

adrinjalali closed this Jun 1, 2022

		- Transformers based on arbitrary functions


		Should all estimators (so including regressors, classifiers, ...) have a "update feature names" method?


		For consistency, we could also add them to all estimators.

		For a regressor or classifier, the method could set the ``input_feature_names_``

SLEP 8: Propagating feature names #18

SLEP 8: Propagating feature names #18

Conversation

jorisvandenbossche commented May 28, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali commented May 31, 2019

amueller commented May 31, 2019

adrinjalali commented May 31, 2019

jnothman commented Jun 3, 2019 via email

amueller left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amueller commented Jun 26, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amueller commented Jul 2, 2019

amueller commented Jul 2, 2019

jnothman commented Jul 2, 2019 via email

amueller commented Jul 3, 2019

amueller commented Jul 3, 2019

amueller commented Jul 3, 2019

amueller commented Jul 3, 2019

adrinjalali commented Jul 5, 2019

amueller commented Jul 5, 2019 • edited Loading

GaelVaroquaux commented Oct 27, 2019 via email

shoyer commented Oct 27, 2019

GaelVaroquaux commented Oct 27, 2019 via email

adrinjalali commented Oct 28, 2019

jorisvandenbossche commented Oct 28, 2019

adrinjalali commented Oct 28, 2019

GaelVaroquaux commented Oct 28, 2019 via email

amueller commented Nov 6, 2019

amueller commented Nov 6, 2019

amueller commented Nov 6, 2019

GaelVaroquaux commented Nov 6, 2019 via email

adrinjalali commented Nov 11, 2019

lorentzenchr commented May 6, 2022

adrinjalali commented May 6, 2022

thomasjpfan commented May 6, 2022

adrinjalali commented May 6, 2022

thomasjpfan commented May 6, 2022

adrinjalali commented May 6, 2022

lorentzenchr commented May 8, 2022

ogrisel commented May 30, 2022

adrinjalali commented Jun 1, 2022

hermidalc commented Jun 4, 2022 • edited Loading

adrinjalali commented Jun 7, 2022

amueller commented Jul 5, 2019 •

edited

Loading

hermidalc commented Jun 4, 2022 •

edited

Loading