Skip to content

POC: New UDF methods option #43678

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Sep 21, 2021

Start of #41112

  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

The goal is to provide an implementation for apply on lists. See #41112 (comment) for why this necessitates a change to agg behavior for lists. It is possible to break this up into several PRs, but wanted to give a view as to what it would appear in totality.

Attached is a notebook demonstrating the differences in behavior (github doesn't allow attaching the notebook directly).

udfs.pdf

{"A": [12, 12, 12], "B": [27, 27, 27]}, index=["sum", "<lambda>", "<lambda>"]
)

result = pdf.apply([np.square, lambda x: x, lambda x: x])
Copy link
Member Author

@rhshadrach rhshadrach Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed from sum to square because lambda x: x.sum() would fail for apply; this is now consistent with

pdf.apply(lambda x: x.sum())

which currently fails master.

}
)
if get_option("new_udf_methods"):
result = string_series.agg(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed from apply to agg as now apply with a dictionary of ops will act the same way as apply with a single op - acting on each row of the Series individually.

Comment on lines +823 to +828
if how == "apply":
expected = DataFrame({name: string_series for name, op in zip(names, ops)})
else:
expected = Series(
{name: op(string_series) for name, op in zip(names, ops)}, name="series"
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Series.apply with a list of ops now acts in the same way as Series.apply with a single op, applying the UDF to each row of the Series. This is a no-op for sum/mean.

Comment on lines +1911 to +1917
if get_option("new_udf_methods"):
expected = Series([1, 1, 1, 3], index=index)
expected.index.name = None
tm.assert_series_equal(table, expected)
else:
expected = DataFrame(index=index)
tm.assert_frame_equal(table, expected)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the bottom of the PDF attached in the OP for some more details on this.

@rhshadrach rhshadrach added API Design Apply Apply, Aggregate, Transform, Map labels Sep 21, 2021
@rhshadrach rhshadrach closed this Sep 21, 2021
@rhshadrach rhshadrach reopened this Sep 21, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to split this :-> i see your comment

@@ -98,6 +102,7 @@ def frame_apply(
result_type=result_type,
args=args,
kwargs=kwargs,
renamer=renamer,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would move before args

for a in arg:
name = None
try:
if isinstance(a, (tuple, list)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob worth having a function to compute a single arg (and then just call in the list)

@rhshadrach rhshadrach marked this pull request as draft September 23, 2021 02:39
@rhshadrach rhshadrach mentioned this pull request Sep 24, 2021
2 tasks
@@ -408,6 +421,70 @@ def agg_list_like(self) -> DataFrame | Series:
)
return concatenated.reindex(full_ordered_index, copy=False)

def new_list_like(self, method: str) -> DataFrame | Series:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be something other than new? i.e. a name that will still be helpful/accurate in 12 months

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - splitting this up, some discussion in the first part: #43736. "future" has been suggested as a replacement for "new".

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Oct 26, 2021
@rhshadrach rhshadrach closed this Oct 26, 2021
@rhshadrach rhshadrach deleted the udfs_poc branch September 10, 2022 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Apply Apply, Aggregate, Transform, Map Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants