Skip to content

ASV: benchmark for DataFrame.Update #58228

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 15, 2024

Conversation

aureliobarbosa
Copy link
Contributor

@aureliobarbosa aureliobarbosa commented Apr 12, 2024

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@rhshadrach and @datapythonista

Benchmarks for the DataFrame.Update method are implemented, as suggested on PR #57637 by @datapythonista and based on the manual tests done by @rhshadrach on #55634. By running ASV locally we measure the change in the performance due to the fix on #57637, as indicated below:

Change Main (2.2.1) [8da8b54] PR (3.0) [93ea131] Ratio Benchmark (Parameter)
- 1.13±0.02ms 887±10μs 0.78 frame_methods.Update.time_to_update_random_indices
- 1.19±0.02ms 840±9μs 0.71 frame_methods.Update.time_to_update_small_frame_big_arg
- 1.27±0.01ms 859±9μs 0.67 frame_methods.Update.time_to_update_big_frame_small_arg

Regards

@aureliobarbosa aureliobarbosa marked this pull request as draft April 12, 2024 00:28
@aureliobarbosa aureliobarbosa changed the title A frame method update ASV: benchmark for DataFrame.Update Apr 12, 2024
@aureliobarbosa aureliobarbosa marked this pull request as ready for review April 12, 2024 00:49
@mroeschke mroeschke added the Benchmark Performance (ASV) benchmarks label Apr 12, 2024
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! This looks good, except the runtime for the benchmark is a bit low (can make it hard to differentiate regressions vs noise). Can you increase the parameters to get runtime up to ~10ms.

@aureliobarbosa
Copy link
Contributor Author

aureliobarbosa commented Apr 13, 2024

Can you increase the parameters to get runtime up to ~10ms.

@rhshadrach Think I am done here.

Data was increased twofold, from $10^4$ to $10^6$ points. Each test is now taking ~4ms. Didn't check the scaling, but it seems to be enough because the data size is similar to other benchmarks in the frame_methods.py. If needed, I can increase the data again.

Change Before [4fe49b1] After [93ea131] <fix_gh55509_v2~14> Ratio Benchmark (Parameter)
- 4.47±0.01ms 3.79±0.1ms 0.85 frame_methods.Update.time_to_update_big_frame_small_arg
- 4.31±0.03ms 3.06±0.03ms 0.71 frame_methods.Update.time_to_update_random_indices
- 2.94±0.02ms 1.10±0.01ms 0.37 frame_methods.Update.time_to_update_small_frame_big_arg

It would be interesting to have some guidance on the best data size for a useful test, but that would be a whole work on itself.

EDIT: Table

Comment on lines 867 to 877
rng = np.random.default_rng()
self.df = DataFrame(rng.uniform(size=(100_000, 10)))

idx = rng.choice(range(100_000), size=100_000, replace=False)
self.df_random = DataFrame(self.df, index=idx)

idx = rng.choice(range(100_000), size=10_000, replace=False)
cols = rng.choice(range(10), size=2, replace=False)
self.df_sample = DataFrame(
rng.uniform(size=(10_000, 2)), index=idx, columns=cols
)
Copy link
Member

@rhshadrach rhshadrach Apr 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rng = np.random.default_rng()
self.df = DataFrame(rng.uniform(size=(100_000, 10)))
idx = rng.choice(range(100_000), size=100_000, replace=False)
self.df_random = DataFrame(self.df, index=idx)
idx = rng.choice(range(100_000), size=10_000, replace=False)
cols = rng.choice(range(10), size=2, replace=False)
self.df_sample = DataFrame(
rng.uniform(size=(10_000, 2)), index=idx, columns=cols
)
rng = np.random.default_rng()
self.df = DataFrame(rng.uniform(size=(1_000_000, 10)))
idx = rng.choice(range(1_000_000), size=1_000_000, replace=False)
self.df_random = DataFrame(self.df, index=idx)
idx = rng.choice(range(1_000_000), size=100_000, replace=False)
cols = rng.choice(range(10), size=2, replace=False)
self.df_sample = DataFrame(
rng.uniform(size=(100_000, 2)), index=idx, columns=cols
)

Can you go one more order of magnitude larger. You increased by two orders of magnitude and saw very little change in runtime - meaning that you were mostly measuring overhead. With the above changes, I'm seeing:

%timeit df.update(df_sample)
%timeit df_random.update(df_sample)
%timeit df_sample.update(df)
33.9 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
46.3 ms ± 744 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
19.4 ms ± 268 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

which is more in the realm of what we want for ASVs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Just for consistency, the benchmarks comparing the last version of DataFrame.Update() and the current one. At some point there is some performance crossover, for a smaller dataframe, it seems to be a win-win. But for larger, there is performance decrease in some cases.

Change Before [4fe49b1] After [93ea131] <fix_gh55509_v2~14> Ratio Benchmark (Parameter)
+ 40.4±1ms 50.7±0.9ms 1.25 frame_methods.Update.time_to_update_big_frame_small_arg
- 68.1±0.8ms 35.9±0.9ms 0.53 frame_methods.Update.time_to_update_random_indices
- 26.4±0.2ms 5.98±0.07ms 0.23 frame_methods.Update.time_to_update_small_frame_big_arg

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rhshadrach rhshadrach merged commit 888b6bc into pandas-dev:main Apr 15, 2024
45 of 46 checks passed
@rhshadrach
Copy link
Member

Thanks @aureliobarbosa!

pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmark Performance (ASV) benchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants