Skip to content

DEPR: deprecate element-wise operations in (Series|DataFrame).transform #54906

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
topper-123 opened this issue Aug 31, 2023 · 12 comments
Open
Labels
Apply Apply, Aggregate, Transform, Map Deprecate Functionality to remove in pandas Transformations e.g. cumsum, diff, rank

Comments

@topper-123
Copy link
Contributor

topper-123 commented Aug 31, 2023

A discussion has been going on in #54747 (PDEP 13) about making Series.transform and DataFrame.transform always operate on Series. See #54747 (comment) and related comments. Opening a separate issue to separate that discussion from PDEP 13/#54747.

Currently, Series.transform tries to operates on series element and if that fails it tries operating on the series. So it uses a fallback mechanism, which makes it difficult to use + the first choice (element-wise operations) is very slow. DataFrame.transform operates on series (i.e. columns/rows) when given callables, but operates on elements, when given lists or dicts of callables, which is inconsistent. Examples:

>>> df = pd.DataFrame({"x":range(100_000)})
>>> %timeit df["x"].transform(lambda x: x + 1) # operates on elements, slow
15.5 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit df['x'].transform(np.sin) # ufunc, fast
784 µs ± 19.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit df['x'].transform(lambda x: np.sin(x)). # non-ufunc, operates on elements, slow
86.6 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df.transform(lambda x: x + 1)  # operates on the columns/series, fast
142 µs ± 589 ns per loop
>>> %timeit df.transform([lambda x: x + 1])  # lists/dicts operate on the elements, slow
16.7 ms ± 165 µs per loop

All in all, the above is very inconsistent and difficult to reason about for users, similarly to the discussion regarding apply in #54747/PDEP 13.

I propose to deprecate element-wise operations in (Series|DataFrame).transform, so in Pandas v3.0 giving callables (and lists/dicts of callables) to (Series|DataFrame).transform always operates on series. The benefit of this is that the (Series|DataFrame).transform method will become much more predictable and faster. When users want to do element-wise operations, they should be directed to use (Series|DataFrame).map. So no functionality is lost, but we get clearer separation between series-wise and element-wise operations.

The deprecation is proposed implemented in pandas v2.2, where we add a new keyword parameter series_ops_only to (Series|DataFrame).transform. When set to true, (Series|DataFrame).transform will always operate on the whole series. When False, the old behavior will be kept, and a deprecation warning will be emitted. In pandas v3.0, the old behavior will be removed and (Series|DataFrame).transform will only operate on series.

Related issues:

@rhshadrach
Copy link
Member

rhshadrach commented Aug 31, 2023

+1. Just noting that DataFrameGroupBy.transform is not a concern here when it comes to API consistency. Users shouldn't be using groupby at all if the transform acts element-wise, and so only vectored transforms should be used in groupby as well.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 4, 2023

I think that what you are proposing is the following:

  • When s is a Series, s.transform(func) will be equivalent to func(s)
  • When df is a DataFrame, df.transform(func) is equivalent to pd.DataFrame({c: func(df[c]) for c in df.columns})

But what happens with df.transform(func, axis=1)? This has never been clear to me, but I think what you're proposing is

  • When df is a DataFrame, df.transform(func, axis=1) is equivalent to pd.DataFrame.from_rows([func(r) for _, r in df.iterrows()], index=df.index) (Note - I may have the use if iterrows() wrong here)

@topper-123
Copy link
Contributor Author

It will be like that, though if you go through the code there are some shape checks etc.

For axis= 1, it does a double transform, see Apply.transform. This proposal does not change that behavior.

@topper-123
Copy link
Contributor Author

and to add, The df.transform(func) case (single callable) already does ops on column series, and that will not change from this. Likewise, df.transform(func, axis=1) will not be affected from this.

list/dicts of callables however currently operate on the elements and the proposal is that they will operate on the column series, like the single callable case.

@topper-123
Copy link
Contributor Author

topper-123 commented Sep 20, 2023

@MarcoGorelli any comments? I think you were positive in #54747 (comment).

[Fixed link - rhshadrach]

@MarcoGorelli
Copy link
Member

the idea looks good

regarding series_ops_only - what will the default be?
say I'm currently using transform to operate on the whole series - will I get any warning, or will I need to pass series_ops_only=True?

@topper-123
Copy link
Contributor Author

I think it should have signature series_ops_only : bool | NoDefault = no_default, and work like normal for a deprecation? I.e. in the 2.x cycle everything works unchanged but gets a FutureWarning by default and if the user sets series_ops_only = True we guarantee series-wise operations. If the user sets series_ops_only = False they get the current behavior without DeprecationWarning.

@MarcoGorelli
Copy link
Member

this could be a bit noisy - is the code implemented using a fallback, such as

try:
    transform_series_wise(...)
except:
    transform_element_wise(...)

?

If so, could the deprecation warning only go in the except block, and there'd be no need for a series_ops_only arg?

@topper-123
Copy link
Contributor Author

topper-123 commented Oct 1, 2023

The is that those give the same results in some cases and not in others, e.g.:

  • elementwise: ser.map(lambda x: x + 1) (like ser.transform now)
  • series-wise ser.pipe(lambda x: x + 1). (like ser.transform proposed for v.30)

give the same results, while

  • elementwise: ser.map(lambda x: type(x)) (like ser.transform now)
  • series-wise ser.pipe(lambda x: type(x)). (like ser.transform proposed for v.30)

would not give the same results.

We could, if the element-wise operations works, try the series-wise operation also and if the results compare equal, not emit the error message. That would cut down a lot of the error messages, but have some cost performance-wise. Note that the series-wise operation and the comparisons will be much faster than the element-wise operation, so perf. cost may not be too bad (and will be temporary for the v2.x cycle).

@MarcoGorelli
Copy link
Member

ah so the element-wise one is tried first? (sorry, really not familiar with this part of the code base)

in that case, +1 to the series_ops_only suggestion

@topper-123
Copy link
Contributor Author

Yeah, the element-wise one is tried first, unfortunately.

@rhshadrach
Copy link
Member

For consistency, does it make sense to use by_rows here too? We've already added that to apply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Deprecate Functionality to remove in pandas Transformations e.g. cumsum, diff, rank
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants