-
-
Notifications
You must be signed in to change notification settings - Fork 34
SLEP 8: Propagating feature names #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLEP 8: Propagating feature names #18
Conversation
gets recursively called on each step of a ``Pipeline`` so that the feature | ||
names get propagated throughout the full ``Pipeline``. This will allow to | ||
inspect the input and output feature names in each step of a ``Pipeline``. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked about having it propagated through the pipeline as the pipeline goes, so that in each step of the pipeline the model could potentially use those names. That's slightly different than recursively calling it to get the names once the pipeline has been fit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we should mention that but maybe you can provide a suggestion for motivation and implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked about having it propagated through the pipeline as the pipeline goes, so that in each step of the pipeline the model could potentially use those names.
That's maybe partly related to what I mentioned below in one of the questions about standalone estimators (not in a pipeline). If we want those to behave similarly, the fit
method of the estimator needs to do something (at least, with the current proposal, calling the "update feature names" method). But if we actually let fit
handle the actual feature name logic (needed for the above suggestion), that directly solves the issue of standalone vs within-pipeline consistency.
potentially removing the need to have an explicit output feature names *getter | ||
method*. The "update feature names" method would then mainly be used for | ||
setting the input features and making sure they get propagated. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for having them [almost] everywhere.
standing estimators and Pipelines. However, the clear downside of this | ||
consistency is that this would add one line to each ``fit`` method throughout | ||
scikit-learn. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's also the option of having a fit
which does some common tasks such as setting the feature names, and letting the child classes only implement _fit
. It kinda goes along the lines of what's being done in scikit-learn/scikit-learn#13603
There was also the concern that the user may want to disable this propagation. (I think this SLEP hasn't addressed that case yet). |
can you elaborate? I don't remember that part. |
I think it was specifically in the context of NLP related usecases where the whole "dictionary" becomes the features and it may be very memory intensive to store them. IIRC @jnothman raised the concern. |
I think it was specifically in the context of NLP related usecases where
the whole "dictionary" becomes the features and it may be very memory
intensive to store them. IIRC @jnothman <https://github.com/jnothman> raised
the concern.
Well our countvectorizer is currently quite naive about this, wrt ngrams,
but we would be forcing this idea of storing feature names in sparse spaces
on people who have been using other vectorization tools.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the slep needs a list of use-cases, in particular comparing the pandas and non-pandas one and checking if there's other relevant cases.
Do we ever actually change feature names that have been set? Maybe to simplify them?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
The core idea of this proposal is that all transformers get a transformative | ||
*"update feature names"* method that can determine the output feature names, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe say that the name is up for discussion?
The ``Pipeline`` and all meta-estimators implement this method by calling it | ||
recursively on all its child steps / sub-estimators, and in this way the input | ||
features names get propagated through the full pipeline. In addition, it sets | ||
the ``input_feature_names_`` attribute on each step of the pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe explain why that is necessary?
features names get propagated through the full pipeline. In addition, it sets | ||
the ``input_feature_names_`` attribute on each step of the pipeline. | ||
|
||
A Pipeline calls the method at the end of ``fit()`` (using the DataFrame column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @adrinjalali says, there is also the possibility to set them during fit.
- Transformers based on arbitrary functions | ||
|
||
|
||
Should all estimators (so including regressors, classifiers, ...) have a "update feature names" method? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method is not properly motivated in the SLEP.
The use case is that X has no column names, right? We fitted a pipeline on a numpy array and we also have feature names and now we want to get the output features.
It might make sense to distinguish the cases where X contains the feature names in the column and where it doesn't because in the first case everything can be automatic.
|
||
For consistency, we could also add them to *all* estimators. | ||
|
||
For a regressor or classifier, the method could set the ``input_feature_names_`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? you mean the output feature names, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regressors don't have output features?
But in general, I am not fully sure anymore what I was thinking here for this section. It all depends on where what we decide where the responsibility lies to set the attributes (does parent pipeline set the attributes and then does the "update feature names" method look first at the attribute, or does the parent pipeline pass the names to the "update feature names" method which then sets the attribute, or ...)
---------------------- | ||
|
||
This SLEP does not affect backward compatibility, as all described attributes | ||
and methods would be new ones, not affecting existing ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well if we reuse get_feature_names
then we add a new parameter in some cases but the old behavior still works.
with a set of custom input feature names that are not identical to the original | ||
DataFrame column names, the stored column names to do validation and the stored | ||
column names to propagate the feature names would get out of sync. Or should | ||
calling ``get_feature_names`` also affect future validation in a ``predict()`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would vote for this option.
There's no discussion of what the vectorizers do with their input feature names, if anything. Is that even allowed? |
transformative "update feature names" method and calling it recursively in the | ||
Pipeline setting the ``input_feature_names_`` attribute on each step. | ||
|
||
1. Only implement the "update feature names" method and require the user to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is the correct distinction but the main point is to always just operate on output feature names, never on input feature names, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand this comment. What do you man with "operating on output feature names" ?
(this alternative of course depends on the idea of having such a "update feature names" method that does the work, but if we decide that actually it should happen in fit that would change things)
Coming back to this, I feel like I now favor a more invasive approach. a) requires a lot of code changes but is pretty nice otherwise, while b) requires no code changes to the fit methods, but creates a new custom class, which could be pretty confusing. If we have always the feature names in I feel that it would be good if we can eliminate having a public method. That means if you need to change your feature names, you have to refit again. An expert might use private methods to prevent this, but I think it's not that important a use-case. Therefore my preferred solution right now:
The main question is then how to implement 1) in the case of numpy arrays, and I think the three options are setting it beforehand, passing it in as an argument, and passing it in as a custom class. Given that 2) requires touching every |
The downside of passing the feature names into |
I don't think special-casing passthrough is as big a deal as changing the
current convention that things passed to fit are sample-aligned. That
affects a whole bunch of meta-estimators, including any defined outside of
sklearn, doesn't it?
Passing the names to fit is by far the most explicit design. I am curious
to know what subclassing looks like... But it makes me think that if it
weren't for sparse matrices, we'd be better off just using DataFrames for
interchange with column names, since wrapping and unwrapping should be fast
with homogeneous dtype...
|
@jnothman I thought there was a concern that pandas might become a column store and wrapping and unwrapping become non-trivial operations? @jorisvandenbossche surely knows more. That would be kind of "pandas in"-"pandas out" then, but only applied to X - which would allow a lot of things already. Yes the passthrough is not actually a big deal, I was just annoyed I actually have to think about the code ;) Not having sample-aligned fit parameters certainly breaks with tradition. If non-sklearn meta-estimators handle them as kwargs they should assume they are sample aligned, so this will break backward-compatibility. Which might show that allowing kwargs wasn't / isnt a good way to do sample props? We could go even a step further and do I haven't finished writing the subclassing thing, but I think it's pretty trivial. We add a function BTW any of these designs get rid of the weird subestimator discovery I had to do, because now everything is done in |
Asked pandas: pandas-dev/pandas#27211 |
Also asked xarray: I think the answer from pandas is as I remembered which is they might change to using 1d slices, and while conversion to pandas might be possible, coming back is not possible without a copy. It seems a bit unnatural to me if we'd produce DataArrays if someone passes in a DataFrame but if DataFrame is not an option then we should consider it. |
@jorisvandenbossche also brought up the option of natively supporting pandas in some estimators and not requiring conversion at all. That might be possible for many of the preprocessing transformers, but not for all. It's certainly desirable to avoid copies, but I don't think it'll provide a full solution to the feature names issue. tldr; not casting pandas to numpy when we don't have to would be nice, but it probably won't solve the feature name issue. |
If we go down the path of having a @jnothman do you see major drawbacks to using an @amueller are you already working on a |
Stephan said in pydata/xarray#3077 that it's a core property of xarray to do zero copy for numeric dtypes, so I think it would be a feasible candidate. @adrinjalali attaching sample props to X wouldn't solve the routing issues, right, as in where to use I haven't started implementing a Right now, |
Alternative: create a scikit-learn 1.0 beta with feature names using duck arrays relying on numpy 1.7 ;)
Late to the party (catching up): no problem on the numpy 1.7 requirement.
|
I think the requirement is actually NumPy 1.17 for |
Late to the party (catching up): no problem on the numpy 1.7 requirement.
I think the requirement is actually NumPy 1.17 for __array_function__?
Yes, at some point (after writing this), my brain reconnected and became
less dumb.
I'm very much not in favor of to bumping the requirement to 1.17 for
1beta.
It's important to have in mind that a core contribution of scikit-learn
to the ecosystem is good numerical algorithms. In my view, it is the most
important one. Sugar on top that makes these good numerical algorithms
easier to use is good. However, it is no use without the good numerics.
Making the numerical algorithms harder to access (for instance with
hard-to-meet requirements) for sugar on top is narrowing their impact on
scientific and data applications.
|
I'm not sure how using |
I don't think it is directly blocked by #22 ( I think the status is that we should update the SLEP with the alternative being discussed above (or, in case that is preferred, write an alternative SLEP for it, but personally I think it would be good to describe the possible options in this SLEP). |
I think the alternative which we mostly agree with, is the one proposed in scikit-learn/scikit-learn#14315. We either need a new slep, or to update this one to reflect the the ideas there. |
People can still rely on older sklearn if they really want to use old numpy.
Telling users that they can either use old versions of sklearn or upgrade
everything makes it hard for users to have a buffer where they
progressively do upgrades. In an ideal world, upgrades should be minor
events that we do, like cleaning our room (I'm terrible at cleaning my
room). The problem is when upgrades become large endeavors. First, we
need to schedule significant time slots for them, second more things are
likely to go wrong together. For instance: if upgrading scikit-learn
triggers a significant upgrade in numpy, which itself triggers a
significant upgrade in pandas, which itself comes with behavior changes.
It's then harder for the user to audit the changes on his analysis or
production tasks. The user is more likely to delay the upgrade, and the
problem becomes worse.
Besides, if we're talking about v1.0, we should be able to think about such changes.
It's all a question of leaving the buffer: if v1.0 happens in two years,
then yes.
I think that an ideal situation would be if we can gradually ramp up the
requirements of scikit-learn. In other terms: if our requirements are
versions that are old enough for users to have had time to adjust,
typically because they are installed by the rolling upgrade of company
policy (which is unfortunately something that varies widely).
|
@GaelVaroquaux My idea was to have a config option that explicitly enables the new behavior and that this config would fail with older numpy (i.e. the config option would have a soft dependency on numpy 1.17) If you have an idea how to implement something like names arrays without numpy 1.17 I think we're all ear. The alternative would be to implement feature names via a different mechanism. |
@hermidalc I agree that more metadata would be nice, but actually we haven't even solved this problem with |
I would suggest we write a separate new slep on the NamedArray. @adrinjalali do you want to do that? You could reuse parts of this one if appropriate. |
@GaelVaroquaux My idea was to have a config option that explicitly enables the new behavior and that this config would fail with older numpy
I think that this is a good suggestion!
|
To be clear, the current implementation of NamedArray only requires numpy 1.13, which is what we'll have once we bump our python support to 3.6 anyway. So that's not a big issue. However, if we want to include more features in the NamedArray as @thomasjpfan has also suggested, then we'll need 1.17. @amueller yes I'll write up a new SLEP. |
This SLEP is about how to propagate feature names, whereas SLEP007 is about how to generate them. @thomasjpfan might have something which would supersede this. |
I think SLEP007 covers both generation and propagation and supersedes this SLEP. In SLEP007's abstract it states:
Also the feature name propagation described in SLEP007 is consistent with the implementation on |
We still don't really have a way to propagate feature names during |
This SLEP does not propose propagating names during
|
It's been discussed in the conversations, not addressed maybe, e.g.: #18 (comment) |
I thought I do some spring cleaning 🧹. From a quick read over the text, I did not notice a difference to the now (v1.1) implemented Maybe, we can discuss our SLEP plans during the next dev meeting? |
SLEP 18 with pandas out will be a partial solution to the problem of propagating features names in pipelines at fit time when all transformers output dense values that naturally fit in a pandas dataframe. Out to propagate feature names for transformers that typically output sparse matrices (e.g. |
Happy to have SLEP18 instead of this one then. |
Apologies I've been out of the loop for a while regarding SLEP enhancements, I thought this was implemented the current release? So you cannot get feature names out of the end of a |
Right now you can only access the feature names outside the |
With much delay, I did a quick clean-up of the draft I wrote beginning of March at the end of the sprint. So here is an initial version of the SLEP on propagating feature names through pipelines.
The PR implementing it is scikit-learn/scikit-learn#13307