Skip to content

VOTE SLEP018 - Pandas Output for Transformers #72

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Aug 19, 2022

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Jul 17, 2022

This PR is for us to discuss and collect votes for SLEP018 - Pandas Output for Transformers. The current implementation is available at scikit-learn/scikit-learn#23734. Note that this vote is for the API and the implementation can be adjusted.

According to our governance model, the vote will be open for a month (till 17th August), and the motion is accepted if 2/3 of the cast votes are in favor.

@scikit-learn/core-devs

@adrinjalali
Copy link
Member

I'd say most users don't have a separate pipeline for their transform steps and another pipeline for adding the final predictor. How does a usual pipeline would look like? Should users do sth like pipeline[:-1].set_output(...) or can they call set_output on the pipeline and expect that to apply only to steps with an available transform method?

Otherwise I'm happy with the SLEP.

@thomasjpfan
Copy link
Member Author

can they call set_output on the pipeline and expect that to apply only to steps with an available transform method?

This is the behavior I implemented and was going for. pipeline.set_output(transform="pandas") will only configure steps that can transform.

@adrinjalali
Copy link
Member

Then the example in the SLEP could also mirror that to make it clear. But it's a +1 for me anyway :)

@thomasjpfan
Copy link
Member Author

In this SLEP, I updated the pipeline example to have a classifier showcasing how set_output can be called on the whole pipeline and only the transformers are configured.

@amueller
Copy link
Member

+1

@jnothman
Copy link
Member

So can I clarify that Pipeline is exceptional in the sense that it is the only non-transformer that has a set_output method (and that it only affects the output of the Pipeline if either it is also a transformer, or some pipeline components behave differently with different input)?

(Do we have other non-transformers that have a transformer for a parameter, aside from TransformedTargetRegressor?)

@thomasjpfan
Copy link
Member Author

Do we have other non-transformers that have a transformer for a parameter, aside from TransformedTargetRegressor?

GridSearchCV can define a transform method if the underlying estimator is a transformer.

Thinking about it more, the special case for Pipeline can influence other meta-estimators. For example, a VotingClassifer with many pipelines:

voting = VotingClassifier([
    ("pipe1", pipe1), ("pipe2", pipe2), ("pipe3", pipe3)
])

# If `VotingClassifier` defines a `set_output`, then the whole pipeline can be configured with:
voting.set_output(transform="pandas")

# If not, then every pipeline needs to be set individually:
voting2 = VotingClassifier([
    ("pipe1, pipe1.set_output(transform="pandas")), ...
])

For a better UX, I think all first-party meta-estimators should define a set_output and configures all their inner estimators. Specifically for a meta-estimator:

  1. If the inner estimator has a set_output, call it (the case of Pipeline).
  2. If the inner estimator is a transformer, then call set_output on it.
  3. Otherwise do nothing.

@ogrisel
Copy link
Member

ogrisel commented Jul 20, 2022

+1.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@amueller
Copy link
Member

amueller commented Aug 5, 2022

ping @GaelVaroquaux ;)

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thoughtful SLEP, the discussions and the prototype, @thomasjpfan.

Thanks for the ping, @amueller :)

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Thank you for the heads-up, @thomasjpfan.

Here are some minor nitpicks.

Comment on lines 37 to 38
``ValueError`` if ``set_output(transform="pandas")``. Dealing with sparse output
might be the scope of another future SLEP.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dealing with sparse output might be the scope of another future SLEP.

Nit: Might it be worth mentioning it in the Discussion section?

Copy link
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

I pushed a commit with a typo fix: StandardScalar -> StandardScaler

@thomasjpfan
Copy link
Member Author

I am also +1 on this SLEP. Including my vote we have 12 in favor and 0 against which means this enhancement proposal is accepted. Thank you everyone for making this possible!

@thomasjpfan thomasjpfan merged commit 23aced5 into scikit-learn:master Aug 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.