-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: DataFrame.update bool dtype being converted to object #55509 #55634
Conversation
…ndas-dev#55589) * Update generic.py * Update generic.py * Update generic.py Fix for issue pandas-dev#55509 (updating intersection only) fix typo - black Added test
This is my first PR. For some reason that I still can't understand the commit that was immediately before the one I submitted got resubmitted. I imagine it can be something usual and easy to fix, but let me know if it is necessary that I fix something (either on GitHub or on my local repository). |
In short. To capture this bug I analyzed the "Unit Tests / Copy-on-Write 3.9 (pull_request)" and captured the single test giving the warning. After studying it it was possible to reproduce this test on my machine using the same pytest called by you on the GitHub CI with a proper environment setup for testing "copy-on-write". For the latter I tested using both the environment variable and In the pull request above the code is failing on Below I put the actual test that I think is wrong, and the new proposed test (for which my new code on update passes): File: def test_update_with_different_dtype(self, using_copy_on_write):
# GH#3217
df = DataFrame({"a": [1, 3], "b": [np.nan, 2]})
df["c"] = np.nan
if using_copy_on_write:
df.update({"c": Series(["foo"], index=[0])})
else:
with tm.assert_produces_warning(FutureWarning, match="incompatible dtype"):
df["c"].update(Series(["foo"], index=[0]))
expected = DataFrame({"a": [1, 3], "b": [np.nan, 2], "c": ["foo", np.nan]})
tm.assert_frame_equal(df, expected) Curren proposed test: def test_update_with_different_dtype(self, using_copy_on_write):
# GH#3217 #55509
df = DataFrame({"a": [1, 3], "b": [np.nan, 2]})
df["c"] = np.nan
with tm.assert_produces_warning(FutureWarning, match="incompatible dtype"):
if using_copy_on_write:
df.update({"c": Series(["foo"], index=[0])})
else:
df["c"].update(Series(["foo"], index=[0]))
expected = DataFrame({"a": [1, 3], "b": [np.nan, 2], "c": ["foo", np.nan]})
tm.assert_frame_equal(df, expected) Given this information I am not sure how to proceed:
Regards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! I think we should add the test from the issue (#55509 (comment)).
pandas/core/frame.py
Outdated
@@ -8857,11 +8857,11 @@ def update( | |||
if not isinstance(other, DataFrame): | |||
other = DataFrame(other) | |||
|
|||
other = other.reindex(self.index) | |||
indexes_intersection = self.index.intersection(other.index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a monotonic index, I'm seeing a perf increase
size = 100_000
df1 = pd.DataFrame({'a': np.random.randint(0, 100, size)}, index=range(1, size+1))
df2 = pd.DataFrame({'a': np.random.randint(0, 100, 3)}, index=[10, 12, 11])
%timeit df1.update(df2)
# 1.33 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) <-- main
# 372 µs ± 663 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each) <-- PR
But a perf regression for nonmonotonic:
size = 100_000
index = list(range(1, size+1))
index[1], index[0] = index[0], index[1]
df1 = pd.DataFrame({'a': np.random.randint(0, 100, size)}, index=index)
df2 = pd.DataFrame({'a': np.random.randint(0, 100, 3)}, index=[10, 12, 11])
%timeit df1.update(df2)
# 1.1 ms ± 3.72 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) <-- main
# 2.52 ms ± 7.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <-- PR
And a perf regression when the caller's index is small and the argument's index is large:
size = 100_000
df1 = pd.DataFrame({'a': np.random.randint(0, 100, 3)}, index=[10, 11, 12])
df2 = pd.DataFrame({'a': np.random.randint(0, 100, size)})
%timeit df1.update(df2)
# 186 µs ± 997 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) <-- main
# 329 µs ± 595 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each) <-- PR
When there are duplicates and the index of the caller and argument are the same, both main and this PR succeed. When they are not the same, both fail but with different error messages
size = 10
df1 = pd.DataFrame({'a': np.random.randint(0, 100, size)})
df2 = pd.DataFrame({'a': [-1, -2, -3]}, index=[4, 4, 6])
df1.update(df2)
# ValueError: cannot reindex on an axis with duplicate labels <-- main
# ValueError: operands could not be broadcast together with shapes (3,) (2,) (3,) <-- PR
Also, when there is no overlap (the intersection is empty), this PR succeeds whereas main fails with the same error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit mixed when it comes to performance, but I'd guess sorted indices and having a large caller with a small updater is the common cause, and this is where it's an improvement.
My recommendation would be to detect duplicate indices (in either the caller or the argument) and raise an informative error message, even when we would otherwise succeed (when there is no overlap).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @mroeschke - do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think that makes sense to me. The error message for duplicates in this PR is not great compared to the error message on main.
@@ -177,3 +177,12 @@ def test_update_dt_column_with_NaT_create_column(self): | |||
{"A": [1.0, 3.0], "B": [pd.NaT, pd.to_datetime("2016-01-01")]} | |||
) | |||
tm.assert_frame_equal(df, expected) | |||
|
|||
def test_update_preserve_column_dtype_bool(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test for when the index has duplicate values
@@ -27,6 +27,7 @@ Bug fixes | |||
- Fixed bug in :meth:`DataFrame.__setitem__` not inferring string dtype for zero-dimensional array with ``infer_string=True`` (:issue:`55366`) | |||
- Fixed bug in :meth:`DataFrame.idxmin` and :meth:`DataFrame.idxmax` raising for arrow dtypes (:issue:`55368`) | |||
- Fixed bug in :meth:`DataFrame.interpolate` raising incorrect error message (:issue:`55347`) | |||
- Fixed bug in :meth:`DataFrame.update` bool dtype being converted to object (:issue:`55509`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a regression, so would go into 2.2.0.
Thanks for the return. I will add a test similar to the one presented in the discussion. |
Organizing ideas: Tests:
Investigate:
Error messages:
Docs:
Questions:
|
pandas/core/frame.py
Outdated
|
||
for col in self.columns.intersection(other.columns): | ||
this = self[col]._values | ||
that = other[col]._values | ||
this = self[col].loc[indexes_intersection]._values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we avoid chained indexing here?
The source of merging conflict is #56228. The merging conflict can be easily solved since it doesn't affect this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delay @aureliobarbosa - this fell of my radar.
other = other.reindex(self.index) | ||
indexes_intersection = other.index.intersection( | ||
self.index | ||
) # order is important |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a message as to why order matters
if not len(indexes_intersection): | ||
raise ValueError( | ||
"Can't update dataframe when other has no index in common with " | ||
"this dataframe." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this not backwards compatible? I think this case does not raise on main (in general), is that right?
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
@aureliobarbosa - okay if I finish this up for 3.0? |
@rhshadrach Thanks for bringing it back. Would you mind if I try to solve it until Feb 20th? |
Sure! |
@rhshadrach Tests are passing on my setup but the commit history of this PR got very noised. I would like to move it to a new PR, making reference to this. Is it OK? |
@aureliobarbosa - I think in this case that's fine. For longer PRs (e.g. something like 300+ line diffs) it can create a lot of rework for reviewers, but that's not the case here. |
@rhshadrach Done here. The intention of the commit below was just a merge but I incorporate the main code fix on it (single line), by mistake. Also, I don't understand what is wrong with the doctest. I am available for any question or fix. |
@aureliobarbosa - thanks for the update. There are 2 questions/requests above that should be addressed.
The doctest is failing on |
@rhshadrach Thanks for the feedback. I will try to address this issue this weekend. |
@rhshadrach I decide to move proposed fix to PR #57637 because there were a few problems on the commit history of this PR that I couldn't fix myself, and also because it was possible to take a new approach on the code there, which I think is simpler and more maintainable. |
Closing in favor of #57637 |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.