Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
SLEP 8: Propagating feature names #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLEP 8: Propagating feature names #18
Changes from 2 commits
d7f804a
79c51dd
c1f642a
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked about having it propagated through the pipeline as the pipeline goes, so that in each step of the pipeline the model could potentially use those names. That's slightly different than recursively calling it to get the names once the pipeline has been
fit
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we should mention that but maybe you can provide a suggestion for motivation and implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's maybe partly related to what I mentioned below in one of the questions about standalone estimators (not in a pipeline). If we want those to behave similarly, the
fit
method of the estimator needs to do something (at least, with the current proposal, calling the "update feature names" method). But if we actually letfit
handle the actual feature name logic (needed for the above suggestion), that directly solves the issue of standalone vs within-pipeline consistency.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe say that the name is up for discussion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe explain why that is necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @adrinjalali says, there is also the possibility to set them during fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure at first whether this was referring to Pipeline.fit, or to step.fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would this only be done by pipelines? Why not for a single transformer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think we should indicate that pipelines will also set the attribute for the last step (which might not be a transformer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I like
feature_names_in
andfeature_names_out
, similar to_n_features_in
(and _n_features_out), see #13603 for the names themselves.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, me too. Will move this suggestion a bit up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for having them [almost] everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's also the option of having a
fit
which does some common tasks such as setting the feature names, and letting the child classes only implement_fit
. It kinda goes along the lines of what's being done in scikit-learn/scikit-learn#13603There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one-to-one is not entirely obvious if we want to actually have a computation graph. But I think we said this is out of scope for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method is not properly motivated in the SLEP.
The use case is that X has no column names, right? We fitted a pipeline on a numpy array and we also have feature names and now we want to get the output features.
It might make sense to distinguish the cases where X contains the feature names in the column and where it doesn't because in the first case everything can be automatic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? you mean the output feature names, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regressors don't have output features?
But in general, I am not fully sure anymore what I was thinking here for this section. It all depends on where what we decide where the responsibility lies to set the attributes (does parent pipeline set the attributes and then does the "update feature names" method look first at the attribute, or does the parent pipeline pass the names to the "update feature names" method which then sets the attribute, or ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would vote for this option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add common checks, there might be some "backward compatibility issues" much like for #22. Maybe that's just something to note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well if we reuse
get_feature_names
then we add a new parameter in some cases but the old behavior still works.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is the correct distinction but the main point is to always just operate on output feature names, never on input feature names, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand this comment. What do you man with "operating on output feature names" ?
(this alternative of course depends on the idea of having such a "update feature names" method that does the work, but if we decide that actually it should happen in fit that would change things)