BUG: DataFrame.update bool dtype being converted to object #55509 #55634

aureliobarbosa · 2023-10-22T19:35:55Z

Fixes BUG: DataFrame.update doesn't preserve dtypes #55509
Tests added and passed.
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

…ndas-dev#55589) * Update generic.py * Update generic.py * Update generic.py Fix for issue pandas-dev#55509 (updating intersection only) fix typo - black Added test

aureliobarbosa · 2023-10-22T19:40:40Z

This is my first PR. For some reason that I still can't understand the commit that was immediately before the one I submitted got resubmitted. I imagine it can be something usual and easy to fix, but let me know if it is necessary that I fix something (either on GitHub or on my local repository).

aureliobarbosa · 2023-10-24T15:19:27Z

@rhshadrach

In short. To capture this bug I analyzed the "Unit Tests / Copy-on-Write 3.9 (pull_request)" and captured the single test giving the warning. After studying it it was possible to reproduce this test on my machine using the same pytest called by you on the GitHub CI with a proper environment setup for testing "copy-on-write". For the latter I tested using both the environment variable and pandas.set_option().

In the pull request above the code is failing on FutureWarning, and I believe that my code is giving the proper results (which were check through local builds), but that the test is outdated (it refer to #3217, which is from before 2013).

Below I put the actual test that I think is wrong, and the new proposed test (for which my new code on update passes):

File: pandas/tests/frame/methods/test_update.py
(Possibly) wrong test:

    def test_update_with_different_dtype(self, using_copy_on_write):
        # GH#3217
        df = DataFrame({"a": [1, 3], "b": [np.nan, 2]})
        df["c"] = np.nan
        if using_copy_on_write:
            df.update({"c": Series(["foo"], index=[0])})
        else:
            with tm.assert_produces_warning(FutureWarning, match="incompatible dtype"):
                df["c"].update(Series(["foo"], index=[0]))

        expected = DataFrame({"a": [1, 3], "b": [np.nan, 2], "c": ["foo", np.nan]})
        tm.assert_frame_equal(df, expected)

Curren proposed test:

    def test_update_with_different_dtype(self, using_copy_on_write):
        # GH#3217 #55509
        df = DataFrame({"a": [1, 3], "b": [np.nan, 2]})
        df["c"] = np.nan
        with tm.assert_produces_warning(FutureWarning, match="incompatible dtype"):
            if using_copy_on_write:
                df.update({"c": Series(["foo"], index=[0])})
            else:
                df["c"].update(Series(["foo"], index=[0]))

        expected = DataFrame({"a": [1, 3], "b": [np.nan, 2], "c": ["foo", np.nan]})
        tm.assert_frame_equal(df, expected)

Given this information I am not sure how to proceed:

Open an issue to correct the test. Wait for review.
Make a new pull request fixing test and the new code (separating the commits for history checks).
Close this pull request and wait for further discussion among core developers.

Regards

rhshadrach

Looking good! I think we should add the test from the issue (#55509 (comment)).

rhshadrach · 2023-10-28T14:41:24Z

pandas/core/frame.py

@@ -8857,11 +8857,11 @@ def update(
        if not isinstance(other, DataFrame):
            other = DataFrame(other)

-        other = other.reindex(self.index)
+        indexes_intersection = self.index.intersection(other.index)


For a monotonic index, I'm seeing a perf increase

size = 100_000 df1 = pd.DataFrame({'a': np.random.randint(0, 100, size)}, index=range(1, size+1)) df2 = pd.DataFrame({'a': np.random.randint(0, 100, 3)}, index=[10, 12, 11]) %timeit df1.update(df2) # 1.33 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) <-- main # 372 µs ± 663 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each) <-- PR

But a perf regression for nonmonotonic:

size = 100_000 index = list(range(1, size+1)) index[1], index[0] = index[0], index[1] df1 = pd.DataFrame({'a': np.random.randint(0, 100, size)}, index=index) df2 = pd.DataFrame({'a': np.random.randint(0, 100, 3)}, index=[10, 12, 11]) %timeit df1.update(df2) # 1.1 ms ± 3.72 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) <-- main # 2.52 ms ± 7.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <-- PR

And a perf regression when the caller's index is small and the argument's index is large:

size = 100_000 df1 = pd.DataFrame({'a': np.random.randint(0, 100, 3)}, index=[10, 11, 12]) df2 = pd.DataFrame({'a': np.random.randint(0, 100, size)}) %timeit df1.update(df2) # 186 µs ± 997 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) <-- main # 329 µs ± 595 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each) <-- PR

When there are duplicates and the index of the caller and argument are the same, both main and this PR succeed. When they are not the same, both fail but with different error messages

size = 10 df1 = pd.DataFrame({'a': np.random.randint(0, 100, size)}) df2 = pd.DataFrame({'a': [-1, -2, -3]}, index=[4, 4, 6]) df1.update(df2) # ValueError: cannot reindex on an axis with duplicate labels <-- main # ValueError: operands could not be broadcast together with shapes (3,) (2,) (3,) <-- PR

Also, when there is no overlap (the intersection is empty), this PR succeeds whereas main fails with the same error message.

A bit mixed when it comes to performance, but I'd guess sorted indices and having a large caller with a small updater is the common cause, and this is where it's an improvement.

My recommendation would be to detect duplicate indices (in either the caller or the argument) and raise an informative error message, even when we would otherwise succeed (when there is no overlap).

cc @mroeschke - do you agree?

Yeah I think that makes sense to me. The error message for duplicates in this PR is not great compared to the error message on main.

pandas/core/generic.py

rhshadrach · 2023-10-28T14:46:27Z

pandas/tests/frame/methods/test_update.py

@@ -177,3 +177,12 @@ def test_update_dt_column_with_NaT_create_column(self):
            {"A": [1.0, 3.0], "B": [pd.NaT, pd.to_datetime("2016-01-01")]}
        )
        tm.assert_frame_equal(df, expected)
+
+    def test_update_preserve_column_dtype_bool(self):


Can you add a test for when the index has duplicate values

rhshadrach · 2023-10-28T14:59:23Z

doc/source/whatsnew/v2.1.2.rst

@@ -27,6 +27,7 @@ Bug fixes
 - Fixed bug in :meth:`DataFrame.__setitem__` not inferring string dtype for zero-dimensional array with ``infer_string=True`` (:issue:`55366`)
 - Fixed bug in :meth:`DataFrame.idxmin` and :meth:`DataFrame.idxmax` raising for arrow dtypes (:issue:`55368`)
 - Fixed bug in :meth:`DataFrame.interpolate` raising incorrect error message (:issue:`55347`)
+- Fixed bug in :meth:`DataFrame.update` bool dtype being converted to object (:issue:`55509`)


This isn't a regression, so would go into 2.2.0.

aureliobarbosa · 2023-10-31T12:39:52Z

Looking good! I think we should add the test from the issue (#55509 (comment)).

Thanks for the return. I will add a test similar to the one presented in the discussion.

aureliobarbosa · 2023-10-31T14:06:08Z

Organizing ideas:

Tests:

Add tests for when the index has duplicate values.
Add the longer test checking various types, from issue BUG: DataFrame.update doesn't preserve dtypes #55509.

Investigate:

implementation of DataFrame.clip(), for ideas.
performance regressions nonmonotonic case.
performance regressions small dataframe with larger dataframe on argument.

Error messages:

duplicate indexes.
no overlap.

Docs:

Change whatsnew.rst, version 2.2

Questions:

Was copy-on-write used on the tests above?
Should strings be included in the longer test?
What should I do about test_update_with_different_dtype(), discussed above?

mroeschke · 2023-10-31T16:23:38Z

pandas/core/frame.py


        for col in self.columns.intersection(other.columns):
-            this = self[col]._values
-            that = other[col]._values
+            this = self[col].loc[indexes_intersection]._values


Could we avoid chained indexing here?

aureliobarbosa · 2023-12-01T20:10:05Z

The source of merging conflict is #56228. The merging conflict can be easily solved since it doesn't affect this PR.

rhshadrach

Apologies for the delay @aureliobarbosa - this fell of my radar.

rhshadrach · 2023-12-05T02:03:07Z

pandas/core/frame.py

-        other = other.reindex(self.index)
+        indexes_intersection = other.index.intersection(
+            self.index
+        )  # order is important


Can you add a message as to why order matters

rhshadrach · 2023-12-05T02:03:51Z

pandas/core/frame.py

+        if not len(indexes_intersection):
+            raise ValueError(
+                "Can't update dataframe when other has no index in common with "
+                "this dataframe."
+            )


Isn't this not backwards compatible? I think this case does not raise on main (in general), is that right?

pandas/core/frame.py

github-actions · 2024-01-06T00:05:30Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

mroeschke · 2024-01-08T21:42:38Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

rhshadrach · 2024-02-18T13:12:00Z

@aureliobarbosa - okay if I finish this up for 3.0?

aureliobarbosa · 2024-02-18T14:03:07Z

@rhshadrach Thanks for bringing it back. Would you mind if I try to solve it until Feb 20th?

rhshadrach · 2024-02-18T14:52:47Z

Sure!

aureliobarbosa · 2024-02-18T21:41:07Z

@rhshadrach Tests are passing on my setup but the commit history of this PR got very noised. I would like to move it to a new PR, making reference to this. Is it OK?

rhshadrach · 2024-02-19T02:59:46Z

@aureliobarbosa - I think in this case that's fine. For longer PRs (e.g. something like 300+ line diffs) it can create a lot of rework for reviewers, but that's not the case here.

aureliobarbosa · 2024-02-20T01:02:46Z

@rhshadrach Done here.

The intention of the commit below was just a merge but I incorporate the main code fix on it (single line), by mistake.
Merge remote-tracking branch 'upstream/main' into fix_gh55509

Also, I don't understand what is wrong with the doctest.

I am available for any question or fix.

rhshadrach · 2024-02-23T20:04:56Z

@aureliobarbosa - thanks for the update. There are 2 questions/requests above that should be addressed.

Also, I don't understand what is wrong with the doctest.

The doctest is failing on DataFrame.to_xarray and this method does use DataFrame.update. It looks like this needs to be addressed.

aureliobarbosa · 2024-02-23T23:21:20Z

@rhshadrach Thanks for the feedback. I will try to address this issue this weekend.

aureliobarbosa · 2024-02-27T12:35:54Z

@rhshadrach I decide to move proposed fix to PR #57637 because there were a few problems on the commit history of this PR that I couldn't fix myself, and also because it was possible to take a new approach on the code there, which I think is simpler and more maintainable.

mroeschke · 2024-02-28T17:56:22Z

Closing in favor of #57637

abhi-glitchhg and others added 2 commits October 22, 2023 15:52

[Documentation] Added another example in df.clip documentation. (pa…

9183b31

…ndas-dev#55589) * Update generic.py * Update generic.py * Update generic.py Fix for issue pandas-dev#55509 (updating intersection only) fix typo - black Added test

BUG: ensure dataFrame.update reads only changing rows (pandas-dev#55509)

af983c8

rhshadrach requested changes Oct 28, 2023

View reviewed changes

rhshadrach reviewed Oct 28, 2023

View reviewed changes

mroeschke reviewed Oct 31, 2023

View reviewed changes

aureliobarbosa added 8 commits November 1, 2023 13:12

fix test not guarding FutureWarning properly

82ecad2

PERF: Avoid chained index - without _values

07046c0

PERF: include _values

e8ee233

Merge remote-tracking branch 'upstream/main' into fix_gh55509

4fd4102

TST: test that update preserve dtype (pandas-dev#55509)

85a107e

TST: update raises on duplicate arg index (pandas-dev#55509)

b2acbc4

TST: msg on update with duplicate indexes (pandas-dev#55509)

c7958e0

BUG: Raise dupl. + no common idx on update pandas-dev#55509

2635c05

aureliobarbosa requested review from rhshadrach and mroeschke November 6, 2023 16:41

aureliobarbosa added 3 commits November 8, 2023 21:45

TST: update with duplicate frame index pandas-dev#55509

bb41261

BUG: update with duplicate frame index pandas-dev#55509

a1786ba

pre-commit

79af494

MichaelTiemannOSC mentioned this pull request Nov 20, 2023

BUG: DataFrame.update doesn't preserve dtypes #55558

Closed

5 tasks

Merge remote-tracking branch 'upstream/main' into fix_gh55509

ee6c3cf

rhshadrach requested changes Dec 5, 2023

View reviewed changes

aureliobarbosa requested a review from rhshadrach December 6, 2023 15:19

github-actions bot added the Stale label Jan 6, 2024

mroeschke closed this Jan 8, 2024

rhshadrach mentioned this pull request Feb 1, 2024

BUG: DataFrame.update() raises FutureWarning due to the use of 'where' method #57124

Open

3 tasks

rhshadrach mentioned this pull request Feb 18, 2024

REGR: DataFrame.update emits spurious warning about downcasting #57485

Merged

5 tasks

rhshadrach reopened this Feb 18, 2024

rhshadrach self-assigned this Feb 18, 2024

rhshadrach added the Blocker Blocking issue or pull request for an upcoming release label Feb 18, 2024

rhshadrach added this to the 3.0 milestone Feb 18, 2024

Merge remote-tracking branch 'upstream/main' into fix_gh55509

ce1b7a7

Merge remote-tracking branch 'upstream/main' into fix_gh55509

fef821f

aureliobarbosa mentioned this pull request Feb 27, 2024

BUG: Fix dataframe.update not updating dtype #55509 v2 #57637

Merged

4 tasks

aureliobarbosa marked this pull request as draft February 27, 2024 02:58

mroeschke closed this Feb 28, 2024

aureliobarbosa deleted the fix_gh55509 branch March 4, 2024 17:22

aureliobarbosa mentioned this pull request Apr 12, 2024

ASV: benchmark for DataFrame.Update #58228

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame.update bool dtype being converted to object #55509 #55634

BUG: DataFrame.update bool dtype being converted to object #55509 #55634

aureliobarbosa commented Oct 22, 2023 •

edited

Loading

aureliobarbosa commented Oct 22, 2023

aureliobarbosa commented Oct 24, 2023

rhshadrach left a comment

rhshadrach Oct 28, 2023 •

edited

Loading

rhshadrach Oct 28, 2023

rhshadrach Oct 28, 2023

mroeschke Oct 31, 2023

rhshadrach Oct 28, 2023

rhshadrach Oct 28, 2023

aureliobarbosa commented Oct 31, 2023

aureliobarbosa commented Oct 31, 2023 •

edited

Loading

mroeschke Oct 31, 2023

aureliobarbosa commented Dec 1, 2023

rhshadrach left a comment

rhshadrach Dec 5, 2023

rhshadrach Dec 5, 2023

github-actions bot commented Jan 6, 2024

mroeschke commented Jan 8, 2024

rhshadrach commented Feb 18, 2024

aureliobarbosa commented Feb 18, 2024

rhshadrach commented Feb 18, 2024

aureliobarbosa commented Feb 18, 2024

rhshadrach commented Feb 19, 2024 •

edited

Loading

aureliobarbosa commented Feb 20, 2024

rhshadrach commented Feb 23, 2024

aureliobarbosa commented Feb 23, 2024

aureliobarbosa commented Feb 27, 2024

mroeschke commented Feb 28, 2024

BUG: DataFrame.update bool dtype being converted to object #55509 #55634

BUG: DataFrame.update bool dtype being converted to object #55509 #55634

Conversation

aureliobarbosa commented Oct 22, 2023 • edited Loading

aureliobarbosa commented Oct 22, 2023

aureliobarbosa commented Oct 24, 2023

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach Oct 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aureliobarbosa commented Oct 31, 2023

aureliobarbosa commented Oct 31, 2023 • edited Loading

Choose a reason for hiding this comment

aureliobarbosa commented Dec 1, 2023

rhshadrach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jan 6, 2024

mroeschke commented Jan 8, 2024

rhshadrach commented Feb 18, 2024

aureliobarbosa commented Feb 18, 2024

rhshadrach commented Feb 18, 2024

aureliobarbosa commented Feb 18, 2024

rhshadrach commented Feb 19, 2024 • edited Loading

aureliobarbosa commented Feb 20, 2024

rhshadrach commented Feb 23, 2024

aureliobarbosa commented Feb 23, 2024

aureliobarbosa commented Feb 27, 2024

mroeschke commented Feb 28, 2024

aureliobarbosa commented Oct 22, 2023 •

edited

Loading

rhshadrach Oct 28, 2023 •

edited

Loading

aureliobarbosa commented Oct 31, 2023 •

edited

Loading

rhshadrach commented Feb 19, 2024 •

edited

Loading