Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix dataframe.update not updating dtype #55509 v2 #57637

Merged
merged 15 commits into from
Mar 2, 2024

Conversation

aureliobarbosa
Copy link
Contributor

@aureliobarbosa aureliobarbosa commented Feb 27, 2024

A new PR was opened to replace the previous one (#55634) because a more idiomatic solution was found. The previous PR will be converted to draft.

@rhshadrach @mroeschke

@aureliobarbosa aureliobarbosa changed the title Fix gh55509 v2 Fix #GH55509 v2 Feb 27, 2024
@aureliobarbosa aureliobarbosa changed the title Fix #GH55509 v2 BUG: Fix dataframe.update not updating dtype#GH55509 v2 Feb 27, 2024
@aureliobarbosa aureliobarbosa changed the title BUG: Fix dataframe.update not updating dtype#GH55509 v2 BUG: Fix dataframe.update not updating dtype #55509 v2 Feb 27, 2024
@datapythonista datapythonista added Bug Dtype Conversions Unexpected or buggy dtype conversions labels Feb 27, 2024
@datapythonista
Copy link
Member

Thanks for the contribution @aureliobarbosa. Can you add a note to the release notes please? You can check other PRs if you are not familiar with it.

@aureliobarbosa
Copy link
Contributor Author

aureliobarbosa commented Feb 27, 2024

Sure. For which version? 3.0? @datapythonista

@datapythonista
Copy link
Member

Sure. For which version? 3.0? @datapythonista

Yes please

@aureliobarbosa
Copy link
Contributor Author

Done @datapythonista

@rhshadrach
Copy link
Member

Benchmarks from #55634 (comment)

size = 100_000
df1 = pd.DataFrame({'a': np.random.randint(0, 100, size)}, index=range(1, size+1))
df2 = pd.DataFrame({'a': np.random.randint(0, 100, 3)}, index=[10, 12, 11])
%timeit df1.update(df2)
# 1.11 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- PR
# 2.32 ms ± 36.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)    <-- main

size = 100_000
index = list(range(1, size+1))
index[1], index[0] = index[0], index[1]
df1 = pd.DataFrame({'a': np.random.randint(0, 100, size)}, index=index)
df2 = pd.DataFrame({'a': np.random.randint(0, 100, 3)}, index=[10, 12, 11])
%timeit df1.update(df2)
# 1.05 ms ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- PR
# 1.76 ms ± 3.73 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- main

size = 100_000
df1 = pd.DataFrame({'a': np.random.randint(0, 100, 3)}, index=[10, 11, 12])
df2 = pd.DataFrame({'a': np.random.randint(0, 100, size)})
%timeit df1.update(df2)
# 517 µs ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- PR
# 318 µs ± 4.85 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- main

size = 10
df1 = pd.DataFrame({'a': np.random.randint(0, 100, size)})
df2 = pd.DataFrame({'a': [-1, -2, -3]}, index=[4, 4, 6])
df1.update(df2)
# ValueError: Update not allowed with duplicate indexes on other.  <-- PR
# ValueError: cannot reindex on an axis with duplicate labels      <-- main

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - one request

@@ -8764,11 +8764,22 @@ def update(
if not isinstance(other, DataFrame):
other = DataFrame(other)

other = other.reindex(self.index)
if other.index.has_duplicates:
raise ValueError("Update not allowed with duplicate indexes on other.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a note that this is not supported to the docstring. There isn't currently a notes section, something like this:

pandas/pandas/core/frame.py

Lines 1450 to 1454 in 737d390

Notes
-----
1. Because ``iterrows`` returns a Series for each row,
it does **not** preserve dtypes across the rows (dtypes are
preserved across columns for DataFrames).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach Done.

DataFrame.reindex raises on the same condition and does not mention it on the docs. What about adding a similar note there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight preference on keeping this PR just modifying update, but no strong objection to modifying the reindex docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. It can be done in another PR.

@rhshadrach
Copy link
Member

/preview

Copy link
Contributor

Website preview of this PR available at: https://pandas.pydata.org/preview/pandas-dev/pandas/57637/

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for working on this. I added couple of comments.

Also, I was checking that we don't seem to have an automated benchmark to see if DataFrame.update is taking longer after this change. I don't think there should be a big impact, but maybe you can check locally how a sample case performs before and after this change (e.g. %timeit df.update(other_df)). Also, you are very welcome to add a benchmark (if you do it, maybe you can open another PR, and we can merge it first, so we can see if there is a regression here).

Thanks!

@@ -8692,6 +8692,10 @@ def update(
dict.update : Similar method for dictionaries.
DataFrame.merge : For column(s)-on-column(s) operations.

Notes
--------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having the ------- longer than the title used to make the CI crash. Not sure why the CI is happy now, but maybe worth making Notes and it the same length in case it fails again in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it has the same size.

rows = other.index.intersection(self.index)
if rows.empty:
raise ValueError(
"Can't update dataframe when other has no index in common with "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No big deal, but in the previous error message you say "update not allowed" and here "can't update dataframe" to mean the same thing. Not sure how we usually write error messages, but I think it'd be better to be consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with keeping "update not allowed" because it states that it is possible but not allowed. Done,

)
def test_update_preserve_dtype(self, value_df, value_other, dtype):
# GH#55509
df = DataFrame({"a": [value_df] * 2}, index=[1, 2])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't seem to use the dtype parameter in any fixture. Did you forget to specify it in the constructors, or am I missing something?

Copy link
Contributor Author

@aureliobarbosa aureliobarbosa Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review and suggestions @datapythonista .

I will address all points as soon as possible (hope to finish by tomorrow).

Regarding the tests, it's an interesting that it is not there! I will take a more detailed look on them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question is, why do you have a parameter for the dtype in the pytest parametrization, if it's not being used? I had the impression that you are planning to set the dtype explicitly when creating the dataframe, but maybe you forgot. If it's not needed, maybe better to remove from the pytest parameters?

Copy link
Contributor Author

@aureliobarbosa aureliobarbosa Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@datapythonista
You are absolutely right!

My tests were originally base on the tests proposed by @MichaelTiemannOSC. I tried to turn them simpler and similar to other tests in the same test set, but since they were failing on main and passing this PR I forgot about them.

This is addressed in the last two commits, which make two of the tests that I propose strictier. Nothing changes but those tests are better now! Thanks for the guidance on this point.

EDIT: I worked on this first because it looked more potentially problematic. As soon as possible the other points will be addressed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complementing, I think it is better to let the dtypes to be explicitly stated on those cases. The reason is that in this way it is possible to check dtype consistency for native python int, numpy, and Extended Arrays, as expected through the 'parametrization'.

@aureliobarbosa
Copy link
Contributor Author

@datapythonista I am looking into ASV right now. Which kind of benchmark would be more useful: time, memory or both? Can you point me to a properly designed pandas benchmark so that I can have a reference for an start on this?

Independent of this, manual tests are going to be done locally and posted here for reference, in the following.

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's worth adding a use case to test when the indices have no intersection?

Also, no big deal, but not sure if we can find a better name for the variable rows. Maybe mask or index_intersection is better.

In any case, this looks good to me.

@datapythonista
Copy link
Member

I think all our benchmarks should be reasonable. Understanding and running asv may be a bit tricky, but writing the benchmark should be trivial, just create some sample data and call DataFrame.update with it. I think checking time is probably good enough. The main complexity in my opinion is to use enough data so the timing is not super noise, but not have too much so the benchmark is slow.

Also, we use benchmarks more for the algorithms that we implement ourselves (like in C) that to check the performance of everything. I think in this case it can be useful, but use your own judgement to decide whether it's useful or not.

@aureliobarbosa
Copy link
Contributor Author

@datapythonista

Do you think it's worth adding a use case to test when the indices have no intersection?

I agree because this is a behavior that was introduced on this PR. Added the test for this case and changed the message raise slightly to emphasize 'no intersection'.

not sure if we can find a better name for the variable rows. Maybe mask or index_intersection is better.

The first of this PR used indexes_intersection, but then I changed to rows to simplify the notation. Now that you raised this point, I noted that a simple search on the file frame.py results in more than one hundred matches. Considering this, I agree with avoided using rows.

@aureliobarbosa
Copy link
Contributor Author

aureliobarbosa commented Mar 2, 2024

I did benchmark again the first three manual code tests proposed previously by @rhshadrach. Results are summarized below in the same order.

Data main PR
monotonic 1.8 ms 1.29 ms
non-monotonic 1.83 ms 1.21 ms
small-frame-large-arg 384 µs 595 µs

The performance gain is not as good as originally expected, but they are still relevant (28%~34% on large frames with small arguments!). @datapythonista Note that when the frame is small and the argument is large there is a performance loss, but as pointed out previously by @rhshadrach this looks like an edge use case.

@aureliobarbosa
Copy link
Contributor Author

Hope to be done here.

@datapythonista Next thing I will do is to try to implement an ASV benchmark for this function.

@rhshadrach I think that the take home message on this PR was that the function was manipulating data it wasn't expecting to touch on, and this case, the change in the dtype was the symptom for this problem.

If you both know or find any bug that could be related or similar to this one, I would be glad to contribute.

Regards

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work @aureliobarbosa.

Do you want to have a last look @rhshadrach?

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rhshadrach rhshadrach added this to the 3.0 milestone Mar 2, 2024
@rhshadrach rhshadrach merged commit 8fde168 into pandas-dev:main Mar 2, 2024
45 of 47 checks passed
@rhshadrach
Copy link
Member

Thanks @aureliobarbosa!

@aureliobarbosa aureliobarbosa deleted the fix_gh55509_v2 branch March 5, 2024 11:57
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DataFrame.update doesn't preserve dtypes
3 participants