-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ASV: benchmark for DataFrame.Update #58228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ASV: benchmark for DataFrame.Update #58228
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! This looks good, except the runtime for the benchmark is a bit low (can make it hard to differentiate regressions vs noise). Can you increase the parameters to get runtime up to ~10ms.
@rhshadrach Think I am done here. Data was increased twofold, from
It would be interesting to have some guidance on the best data size for a useful test, but that would be a whole work on itself. EDIT: Table |
rng = np.random.default_rng() | ||
self.df = DataFrame(rng.uniform(size=(100_000, 10))) | ||
|
||
idx = rng.choice(range(100_000), size=100_000, replace=False) | ||
self.df_random = DataFrame(self.df, index=idx) | ||
|
||
idx = rng.choice(range(100_000), size=10_000, replace=False) | ||
cols = rng.choice(range(10), size=2, replace=False) | ||
self.df_sample = DataFrame( | ||
rng.uniform(size=(10_000, 2)), index=idx, columns=cols | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rng = np.random.default_rng() | |
self.df = DataFrame(rng.uniform(size=(100_000, 10))) | |
idx = rng.choice(range(100_000), size=100_000, replace=False) | |
self.df_random = DataFrame(self.df, index=idx) | |
idx = rng.choice(range(100_000), size=10_000, replace=False) | |
cols = rng.choice(range(10), size=2, replace=False) | |
self.df_sample = DataFrame( | |
rng.uniform(size=(10_000, 2)), index=idx, columns=cols | |
) | |
rng = np.random.default_rng() | |
self.df = DataFrame(rng.uniform(size=(1_000_000, 10))) | |
idx = rng.choice(range(1_000_000), size=1_000_000, replace=False) | |
self.df_random = DataFrame(self.df, index=idx) | |
idx = rng.choice(range(1_000_000), size=100_000, replace=False) | |
cols = rng.choice(range(10), size=2, replace=False) | |
self.df_sample = DataFrame( | |
rng.uniform(size=(100_000, 2)), index=idx, columns=cols | |
) |
Can you go one more order of magnitude larger. You increased by two orders of magnitude and saw very little change in runtime - meaning that you were mostly measuring overhead. With the above changes, I'm seeing:
%timeit df.update(df_sample)
%timeit df_random.update(df_sample)
%timeit df_sample.update(df)
33.9 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
46.3 ms ± 744 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
19.4 ms ± 268 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
which is more in the realm of what we want for ASVs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Just for consistency, the benchmarks comparing the last version of DataFrame.Update() and the current one. At some point there is some performance crossover, for a smaller dataframe, it seems to be a win-win. But for larger, there is performance decrease in some cases.
Change | Before [4fe49b1] | After [93ea131] <fix_gh55509_v2~14> | Ratio | Benchmark (Parameter) |
---|---|---|---|---|
+ | 40.4±1ms | 50.7±0.9ms | 1.25 | frame_methods.Update.time_to_update_big_frame_small_arg |
- | 68.1±0.8ms | 35.9±0.9ms | 0.53 | frame_methods.Update.time_to_update_random_indices |
- | 26.4±0.2ms | 5.98±0.07ms | 0.23 | frame_methods.Update.time_to_update_small_frame_big_arg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks @aureliobarbosa! |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.@rhshadrach and @datapythonista
Benchmarks for the DataFrame.Update method are implemented, as suggested on PR #57637 by @datapythonista and based on the manual tests done by @rhshadrach on #55634. By running ASV locally we measure the change in the performance due to the fix on #57637, as indicated below:
Regards