-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix dataframe.update not updating dtype #55509 v2 #57637
Conversation
Thanks for the contribution @aureliobarbosa. Can you add a note to the release notes please? You can check other PRs if you are not familiar with it. |
Sure. For which version? 3.0? @datapythonista |
Yes please |
Done @datapythonista |
Benchmarks from #55634 (comment)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm - one request
@@ -8764,11 +8764,22 @@ def update( | |||
if not isinstance(other, DataFrame): | |||
other = DataFrame(other) | |||
|
|||
other = other.reindex(self.index) | |||
if other.index.has_duplicates: | |||
raise ValueError("Update not allowed with duplicate indexes on other.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a note that this is not supported to the docstring. There isn't currently a notes section, something like this:
Lines 1450 to 1454 in 737d390
Notes | |
----- | |
1. Because ``iterrows`` returns a Series for each row, | |
it does **not** preserve dtypes across the rows (dtypes are | |
preserved across columns for DataFrames). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhshadrach Done.
DataFrame.reindex
raises on the same condition and does not mention it on the docs. What about adding a similar note there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slight preference on keeping this PR just modifying update, but no strong objection to modifying the reindex docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. It can be done in another PR.
/preview |
Website preview of this PR available at: https://pandas.pydata.org/preview/pandas-dev/pandas/57637/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for working on this. I added couple of comments.
Also, I was checking that we don't seem to have an automated benchmark to see if DataFrame.update
is taking longer after this change. I don't think there should be a big impact, but maybe you can check locally how a sample case performs before and after this change (e.g. %timeit df.update(other_df)
). Also, you are very welcome to add a benchmark (if you do it, maybe you can open another PR, and we can merge it first, so we can see if there is a regression here).
Thanks!
pandas/core/frame.py
Outdated
@@ -8692,6 +8692,10 @@ def update( | |||
dict.update : Similar method for dictionaries. | |||
DataFrame.merge : For column(s)-on-column(s) operations. | |||
|
|||
Notes | |||
-------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having the -------
longer than the title used to make the CI crash. Not sure why the CI is happy now, but maybe worth making Notes
and it the same length in case it fails again in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it has the same size.
pandas/core/frame.py
Outdated
rows = other.index.intersection(self.index) | ||
if rows.empty: | ||
raise ValueError( | ||
"Can't update dataframe when other has no index in common with " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No big deal, but in the previous error message you say "update not allowed" and here "can't update dataframe" to mean the same thing. Not sure how we usually write error messages, but I think it'd be better to be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with keeping "update not allowed" because it states that it is possible but not allowed. Done,
) | ||
def test_update_preserve_dtype(self, value_df, value_other, dtype): | ||
# GH#55509 | ||
df = DataFrame({"a": [value_df] * 2}, index=[1, 2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't seem to use the dtype
parameter in any fixture. Did you forget to specify it in the constructors, or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review and suggestions @datapythonista .
I will address all points as soon as possible (hope to finish by tomorrow).
Regarding the tests, it's an interesting that it is not there! I will take a more detailed look on them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question is, why do you have a parameter for the dtype
in the pytest parametrization, if it's not being used? I had the impression that you are planning to set the dtype explicitly when creating the dataframe, but maybe you forgot. If it's not needed, maybe better to remove from the pytest parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@datapythonista
You are absolutely right!
My tests were originally base on the tests proposed by @MichaelTiemannOSC. I tried to turn them simpler and similar to other tests in the same test set, but since they were failing on main and passing this PR I forgot about them.
This is addressed in the last two commits, which make two of the tests that I propose strictier. Nothing changes but those tests are better now! Thanks for the guidance on this point.
EDIT: I worked on this first because it looked more potentially problematic. As soon as possible the other points will be addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complementing, I think it is better to let the dtypes
to be explicitly stated on those cases. The reason is that in this way it is possible to check dtype
consistency for native python int
, numpy, and Extended Arrays, as expected through the 'parametrization'.
@datapythonista I am looking into ASV right now. Which kind of benchmark would be more useful: time, memory or both? Can you point me to a properly designed pandas benchmark so that I can have a reference for an start on this? Independent of this, manual tests are going to be done locally and posted here for reference, in the following. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's worth adding a use case to test when the indices have no intersection?
Also, no big deal, but not sure if we can find a better name for the variable rows
. Maybe mask
or index_intersection
is better.
In any case, this looks good to me.
I think all our benchmarks should be reasonable. Understanding and running asv may be a bit tricky, but writing the benchmark should be trivial, just create some sample data and call Also, we use benchmarks more for the algorithms that we implement ourselves (like in C) that to check the performance of everything. I think in this case it can be useful, but use your own judgement to decide whether it's useful or not. |
I agree because this is a behavior that was introduced on this PR. Added the test for this case and changed the message raise slightly to emphasize 'no intersection'.
The first of this PR used |
I did benchmark again the first three manual code tests proposed previously by @rhshadrach. Results are summarized below in the same order.
The performance gain is not as good as originally expected, but they are still relevant (28%~34% on large frames with small arguments!). @datapythonista Note that when the frame is small and the argument is large there is a performance loss, but as pointed out previously by @rhshadrach this looks like an edge use case. |
Hope to be done here. @datapythonista Next thing I will do is to try to implement an ASV benchmark for this function. @rhshadrach I think that the take home message on this PR was that the function was manipulating data it wasn't expecting to touch on, and this case, the change in the dtype was the symptom for this problem. If you both know or find any bug that could be related or similar to this one, I would be glad to contribute. Regards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work @aureliobarbosa.
Do you want to have a last look @rhshadrach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks @aureliobarbosa! |
doc/source/whatsnew/v3.0.rst
. (?)A new PR was opened to replace the previous one (#55634) because a more idiomatic solution was found. The previous PR will be converted to draft.
@rhshadrach @mroeschke