VOTE SLEP018 - Pandas Output for Transformers #72

thomasjpfan · 2022-07-17T17:11:25Z

This PR is for us to discuss and collect votes for SLEP018 - Pandas Output for Transformers. The current implementation is available at scikit-learn/scikit-learn#23734. Note that this vote is for the API and the implementation can be adjusted.

According to our governance model, the vote will be open for a month (till 17th August), and the motion is accepted if 2/3 of the cast votes are in favor.

@scikit-learn/core-devs

adrinjalali · 2022-07-18T11:58:09Z

I'd say most users don't have a separate pipeline for their transform steps and another pipeline for adding the final predictor. How does a usual pipeline would look like? Should users do sth like pipeline[:-1].set_output(...) or can they call set_output on the pipeline and expect that to apply only to steps with an available transform method?

Otherwise I'm happy with the SLEP.

thomasjpfan · 2022-07-18T12:30:55Z

can they call set_output on the pipeline and expect that to apply only to steps with an available transform method?

This is the behavior I implemented and was going for. pipeline.set_output(transform="pandas") will only configure steps that can transform.

adrinjalali · 2022-07-18T12:33:16Z

Then the example in the SLEP could also mirror that to make it clear. But it's a +1 for me anyway :)

thomasjpfan · 2022-07-18T17:58:29Z

In this SLEP, I updated the pipeline example to have a classifier showcasing how set_output can be called on the whole pipeline and only the transformers are configured.

amueller · 2022-07-19T16:28:36Z

+1

jnothman · 2022-07-20T02:53:23Z

So can I clarify that Pipeline is exceptional in the sense that it is the only non-transformer that has a set_output method (and that it only affects the output of the Pipeline if either it is also a transformer, or some pipeline components behave differently with different input)?

(Do we have other non-transformers that have a transformer for a parameter, aside from TransformedTargetRegressor?)

thomasjpfan · 2022-07-20T14:21:50Z

Do we have other non-transformers that have a transformer for a parameter, aside from TransformedTargetRegressor?

GridSearchCV can define a transform method if the underlying estimator is a transformer.

Thinking about it more, the special case for Pipeline can influence other meta-estimators. For example, a VotingClassifer with many pipelines:

voting = VotingClassifier([
    ("pipe1", pipe1), ("pipe2", pipe2), ("pipe3", pipe3)
])

# If `VotingClassifier` defines a `set_output`, then the whole pipeline can be configured with:
voting.set_output(transform="pandas")

# If not, then every pipeline needs to be set individually:
voting2 = VotingClassifier([
    ("pipe1, pipe1.set_output(transform="pandas")), ...
])

For a better UX, I think all first-party meta-estimators should define a set_output and configures all their inner estimators. Specifically for a meta-estimator:

If the inner estimator has a set_output, call it (the case of Pipeline).
If the inner estimator is a transformer, then call set_output on it.
Otherwise do nothing.

ogrisel · 2022-07-20T15:00:19Z

+1.

glemaitre

+1

amueller · 2022-08-05T16:34:28Z

ping @GaelVaroquaux ;)

GaelVaroquaux

Thanks for the thoughtful SLEP, the discussions and the prototype, @thomasjpfan.

Thanks for the ping, @amueller :)

jjerphan

+1

Thank you for the heads-up, @thomasjpfan.

Here are some minor nitpicks.

slep018/proposal.rst

jjerphan · 2022-08-11T06:09:50Z

slep018/proposal.rst

 ``ValueError`` if ``set_output(transform="pandas")``. Dealing with sparse output
 might be the scope of another future SLEP.


Dealing with sparse output might be the scope of another future SLEP.

Nit: Might it be worth mentioning it in the Discussion section?

Co-authored-by: Julien Jerphanion <[email protected]>

lesteve

+1

I pushed a commit with a typo fix: StandardScalar -> StandardScaler

thomasjpfan · 2022-08-19T23:26:11Z

I am also +1 on this SLEP. Including my vote we have 12 in favor and 0 against which means this enhancement proposal is accepted. Thank you everyone for making this possible!

VOTE SLEP018 - Pandas Output for Transformers

484da75

adrinjalali approved these changes Jul 18, 2022

View reviewed changes

thomasjpfan added 5 commits July 18, 2022 07:47

DOC Improve wording

ef64074

DOC Adds example for pipeline.set_output with a classifier

c8d4fb1

DOC Fix variable

d21e9f5

DOC Improve example

c8a7698

DOC Improve example

c73fccb

Merge remote-tracking branch 'upstream/master' into vote_slep_018

e4c754f

ogrisel approved these changes Jul 20, 2022

View reviewed changes

jnothman approved these changes Jul 24, 2022

View reviewed changes

glemaitre approved these changes Jul 25, 2022

View reviewed changes

amueller approved these changes Jul 25, 2022

View reviewed changes

agramfort approved these changes Jul 27, 2022

View reviewed changes

TomDLT approved these changes Aug 1, 2022

View reviewed changes

lorentzenchr approved these changes Aug 1, 2022

View reviewed changes

GaelVaroquaux approved these changes Aug 8, 2022

View reviewed changes

jjerphan approved these changes Aug 11, 2022

View reviewed changes

thomasjpfan and others added 2 commits August 11, 2022 11:43

Update slep018/proposal.rst

39a8806

Co-authored-by: Julien Jerphanion <[email protected]>

DOC Include sparse data in discussions

4dac8c9

thomasjpfan mentioned this pull request Aug 14, 2022

sklearn.compose.ColumnTransformer do not keep transformers` desired dtype of output scikit-learn/scikit-learn#24182

Closed

Fix typo

1b2b73e

lesteve approved these changes Aug 16, 2022

View reviewed changes

thomasjpfan merged commit 23aced5 into scikit-learn:master Aug 19, 2022

amueller mentioned this pull request Nov 28, 2022

SLEP015: Feature Names Propagation #48

Merged

lorentzenchr mentioned this pull request Oct 6, 2024

SLEP012 input_feature_names_ #35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VOTE SLEP018 - Pandas Output for Transformers #72

VOTE SLEP018 - Pandas Output for Transformers #72

thomasjpfan commented Jul 17, 2022 •

edited

Loading

adrinjalali commented Jul 18, 2022

thomasjpfan commented Jul 18, 2022

adrinjalali commented Jul 18, 2022

thomasjpfan commented Jul 18, 2022

amueller commented Jul 19, 2022

jnothman commented Jul 20, 2022

thomasjpfan commented Jul 20, 2022

ogrisel commented Jul 20, 2022

glemaitre left a comment

amueller commented Aug 5, 2022

GaelVaroquaux left a comment

jjerphan left a comment

jjerphan Aug 11, 2022

lesteve left a comment

thomasjpfan commented Aug 19, 2022

		``ValueError`` if ``set_output(transform="pandas")``. Dealing with sparse output
		might be the scope of another future SLEP.

VOTE SLEP018 - Pandas Output for Transformers #72

VOTE SLEP018 - Pandas Output for Transformers #72

Conversation

thomasjpfan commented Jul 17, 2022 • edited Loading

adrinjalali commented Jul 18, 2022

thomasjpfan commented Jul 18, 2022

adrinjalali commented Jul 18, 2022

thomasjpfan commented Jul 18, 2022

amueller commented Jul 19, 2022

jnothman commented Jul 20, 2022

thomasjpfan commented Jul 20, 2022

ogrisel commented Jul 20, 2022

glemaitre left a comment

Choose a reason for hiding this comment

amueller commented Aug 5, 2022

GaelVaroquaux left a comment

Choose a reason for hiding this comment

jjerphan left a comment

Choose a reason for hiding this comment

jjerphan Aug 11, 2022

Choose a reason for hiding this comment

lesteve left a comment

Choose a reason for hiding this comment

thomasjpfan commented Aug 19, 2022

thomasjpfan commented Jul 17, 2022 •

edited

Loading