Skip to content

ASV: benchmark for DataFrame.Update #58228

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 15, 2024
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions asv_bench/benchmarks/frame_methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -862,4 +862,28 @@ def time_last_valid_index(self, dtype):
self.df.last_valid_index()


class Update:
def setup(self):
rng = np.random.default_rng()
self.df = DataFrame(rng.uniform(size=(100_000, 10)))

idx = rng.choice(range(100_000), size=100_000, replace=False)
self.df_random = DataFrame(self.df, index=idx)

idx = rng.choice(range(100_000), size=10_000, replace=False)
cols = rng.choice(range(10), size=2, replace=False)
self.df_sample = DataFrame(
rng.uniform(size=(10_000, 2)), index=idx, columns=cols
)
Copy link
Member

@rhshadrach rhshadrach Apr 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rng = np.random.default_rng()
self.df = DataFrame(rng.uniform(size=(100_000, 10)))
idx = rng.choice(range(100_000), size=100_000, replace=False)
self.df_random = DataFrame(self.df, index=idx)
idx = rng.choice(range(100_000), size=10_000, replace=False)
cols = rng.choice(range(10), size=2, replace=False)
self.df_sample = DataFrame(
rng.uniform(size=(10_000, 2)), index=idx, columns=cols
)
rng = np.random.default_rng()
self.df = DataFrame(rng.uniform(size=(1_000_000, 10)))
idx = rng.choice(range(1_000_000), size=1_000_000, replace=False)
self.df_random = DataFrame(self.df, index=idx)
idx = rng.choice(range(1_000_000), size=100_000, replace=False)
cols = rng.choice(range(10), size=2, replace=False)
self.df_sample = DataFrame(
rng.uniform(size=(100_000, 2)), index=idx, columns=cols
)

Can you go one more order of magnitude larger. You increased by two orders of magnitude and saw very little change in runtime - meaning that you were mostly measuring overhead. With the above changes, I'm seeing:

%timeit df.update(df_sample)
%timeit df_random.update(df_sample)
%timeit df_sample.update(df)
33.9 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
46.3 ms ± 744 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
19.4 ms ± 268 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

which is more in the realm of what we want for ASVs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Just for consistency, the benchmarks comparing the last version of DataFrame.Update() and the current one. At some point there is some performance crossover, for a smaller dataframe, it seems to be a win-win. But for larger, there is performance decrease in some cases.

Change Before [4fe49b1] After [93ea131] <fix_gh55509_v2~14> Ratio Benchmark (Parameter)
+ 40.4±1ms 50.7±0.9ms 1.25 frame_methods.Update.time_to_update_big_frame_small_arg
- 68.1±0.8ms 35.9±0.9ms 0.53 frame_methods.Update.time_to_update_random_indices
- 26.4±0.2ms 5.98±0.07ms 0.23 frame_methods.Update.time_to_update_small_frame_big_arg


def time_to_update_big_frame_small_arg(self):
self.df.update(self.df_sample)

def time_to_update_random_indices(self):
self.df_random.update(self.df_sample)

def time_to_update_small_frame_big_arg(self):
self.df_sample.update(self.df)


from .pandas_vb_common import setup # noqa: F401 isort:skip