Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.update bool dtype being converted to object #55509 #55634

Closed
wants to merge 16 commits into from

Conversation

aureliobarbosa
Copy link
Contributor

@aureliobarbosa aureliobarbosa commented Oct 22, 2023

abhi-glitchhg and others added 2 commits October 22, 2023 15:52
…ndas-dev#55589)

* Update generic.py

* Update generic.py

* Update generic.py

Fix for issue pandas-dev#55509 (updating intersection only)

fix typo - black

Added test
@aureliobarbosa
Copy link
Contributor Author

This is my first PR. For some reason that I still can't understand the commit that was immediately before the one I submitted got resubmitted. I imagine it can be something usual and easy to fix, but let me know if it is necessary that I fix something (either on GitHub or on my local repository).

@aureliobarbosa
Copy link
Contributor Author

@rhshadrach

In short. To capture this bug I analyzed the "Unit Tests / Copy-on-Write 3.9 (pull_request)" and captured the single test giving the warning. After studying it it was possible to reproduce this test on my machine using the same pytest called by you on the GitHub CI with a proper environment setup for testing "copy-on-write". For the latter I tested using both the environment variable and pandas.set_option().

In the pull request above the code is failing on FutureWarning, and I believe that my code is giving the proper results (which were check through local builds), but that the test is outdated (it refer to #3217, which is from before 2013).

Below I put the actual test that I think is wrong, and the new proposed test (for which my new code on update passes):

File: pandas/tests/frame/methods/test_update.py
(Possibly) wrong test:

    def test_update_with_different_dtype(self, using_copy_on_write):
        # GH#3217
        df = DataFrame({"a": [1, 3], "b": [np.nan, 2]})
        df["c"] = np.nan
        if using_copy_on_write:
            df.update({"c": Series(["foo"], index=[0])})
        else:
            with tm.assert_produces_warning(FutureWarning, match="incompatible dtype"):
                df["c"].update(Series(["foo"], index=[0]))

        expected = DataFrame({"a": [1, 3], "b": [np.nan, 2], "c": ["foo", np.nan]})
        tm.assert_frame_equal(df, expected)

Curren proposed test:

    def test_update_with_different_dtype(self, using_copy_on_write):
        # GH#3217 #55509
        df = DataFrame({"a": [1, 3], "b": [np.nan, 2]})
        df["c"] = np.nan
        with tm.assert_produces_warning(FutureWarning, match="incompatible dtype"):
            if using_copy_on_write:
                df.update({"c": Series(["foo"], index=[0])})
            else:
                df["c"].update(Series(["foo"], index=[0]))

        expected = DataFrame({"a": [1, 3], "b": [np.nan, 2], "c": ["foo", np.nan]})
        tm.assert_frame_equal(df, expected)

Given this information I am not sure how to proceed:

  1. Open an issue to correct the test. Wait for review.
  2. Make a new pull request fixing test and the new code (separating the commits for history checks).
  3. Close this pull request and wait for further discussion among core developers.

Regards

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! I think we should add the test from the issue (#55509 (comment)).

@@ -8857,11 +8857,11 @@ def update(
if not isinstance(other, DataFrame):
other = DataFrame(other)

other = other.reindex(self.index)
indexes_intersection = self.index.intersection(other.index)
Copy link
Member

@rhshadrach rhshadrach Oct 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a monotonic index, I'm seeing a perf increase

size = 100_000
df1 = pd.DataFrame({'a': np.random.randint(0, 100, size)}, index=range(1, size+1))
df2 = pd.DataFrame({'a': np.random.randint(0, 100, 3)}, index=[10, 12, 11])

%timeit df1.update(df2)
# 1.33 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- main
# 372 µs ± 663 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)    <-- PR

But a perf regression for nonmonotonic:

size = 100_000
index = list(range(1, size+1))
index[1], index[0] = index[0], index[1]
df1 = pd.DataFrame({'a': np.random.randint(0, 100, size)}, index=index)
df2 = pd.DataFrame({'a': np.random.randint(0, 100, 3)}, index=[10, 12, 11])
%timeit df1.update(df2)
# 1.1 ms ± 3.72 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- main
# 2.52 ms ± 7.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)    <-- PR

And a perf regression when the caller's index is small and the argument's index is large:

size = 100_000
df1 = pd.DataFrame({'a': np.random.randint(0, 100, 3)}, index=[10, 11, 12])
df2 = pd.DataFrame({'a': np.random.randint(0, 100, size)})
%timeit df1.update(df2)
# 186 µs ± 997 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)  <-- main
# 329 µs ± 595 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)   <-- PR

When there are duplicates and the index of the caller and argument are the same, both main and this PR succeed. When they are not the same, both fail but with different error messages

size = 10
df1 = pd.DataFrame({'a': np.random.randint(0, 100, size)})
df2 = pd.DataFrame({'a': [-1, -2, -3]}, index=[4, 4, 6])
df1.update(df2)
# ValueError: cannot reindex on an axis with duplicate labels  <-- main
# ValueError: operands could not be broadcast together with shapes (3,) (2,) (3,)  <-- PR

Also, when there is no overlap (the intersection is empty), this PR succeeds whereas main fails with the same error message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit mixed when it comes to performance, but I'd guess sorted indices and having a large caller with a small updater is the common cause, and this is where it's an improvement.

My recommendation would be to detect duplicate indices (in either the caller or the argument) and raise an informative error message, even when we would otherwise succeed (when there is no overlap).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mroeschke - do you agree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that makes sense to me. The error message for duplicates in this PR is not great compared to the error message on main.

@@ -177,3 +177,12 @@ def test_update_dt_column_with_NaT_create_column(self):
{"A": [1.0, 3.0], "B": [pd.NaT, pd.to_datetime("2016-01-01")]}
)
tm.assert_frame_equal(df, expected)

def test_update_preserve_column_dtype_bool(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for when the index has duplicate values

@@ -27,6 +27,7 @@ Bug fixes
- Fixed bug in :meth:`DataFrame.__setitem__` not inferring string dtype for zero-dimensional array with ``infer_string=True`` (:issue:`55366`)
- Fixed bug in :meth:`DataFrame.idxmin` and :meth:`DataFrame.idxmax` raising for arrow dtypes (:issue:`55368`)
- Fixed bug in :meth:`DataFrame.interpolate` raising incorrect error message (:issue:`55347`)
- Fixed bug in :meth:`DataFrame.update` bool dtype being converted to object (:issue:`55509`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a regression, so would go into 2.2.0.

@aureliobarbosa
Copy link
Contributor Author

Looking good! I think we should add the test from the issue (#55509 (comment)).

Thanks for the return. I will add a test similar to the one presented in the discussion.

@aureliobarbosa
Copy link
Contributor Author

aureliobarbosa commented Oct 31, 2023

Organizing ideas:

Tests:

Investigate:

  • implementation of DataFrame.clip(), for ideas.
  • performance regressions nonmonotonic case.
  • performance regressions small dataframe with larger dataframe on argument.

Error messages:

  • duplicate indexes.
  • no overlap.

Docs:

  • Change whatsnew.rst, version 2.2

Questions:

  • Was copy-on-write used on the tests above?
  • Should strings be included in the longer test?
  • What should I do about test_update_with_different_dtype(), discussed above?


for col in self.columns.intersection(other.columns):
this = self[col]._values
that = other[col]._values
this = self[col].loc[indexes_intersection]._values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid chained indexing here?

@aureliobarbosa
Copy link
Contributor Author

The source of merging conflict is #56228. The merging conflict can be easily solved since it doesn't affect this PR.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay @aureliobarbosa - this fell of my radar.

other = other.reindex(self.index)
indexes_intersection = other.index.intersection(
self.index
) # order is important
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a message as to why order matters

Comment on lines +8875 to +8879
if not len(indexes_intersection):
raise ValueError(
"Can't update dataframe when other has no index in common with "
"this dataframe."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this not backwards compatible? I think this case does not raise on main (in general), is that right?

Copy link
Contributor

github-actions bot commented Jan 6, 2024

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jan 6, 2024
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@rhshadrach
Copy link
Member

@aureliobarbosa - okay if I finish this up for 3.0?

@rhshadrach rhshadrach reopened this Feb 18, 2024
@rhshadrach rhshadrach self-assigned this Feb 18, 2024
@rhshadrach rhshadrach added the Blocker Blocking issue or pull request for an upcoming release label Feb 18, 2024
@rhshadrach rhshadrach added this to the 3.0 milestone Feb 18, 2024
@aureliobarbosa
Copy link
Contributor Author

@rhshadrach Thanks for bringing it back. Would you mind if I try to solve it until Feb 20th?

@rhshadrach
Copy link
Member

Sure!

@aureliobarbosa
Copy link
Contributor Author

@rhshadrach Tests are passing on my setup but the commit history of this PR got very noised. I would like to move it to a new PR, making reference to this. Is it OK?

@rhshadrach
Copy link
Member

rhshadrach commented Feb 19, 2024

@aureliobarbosa - I think in this case that's fine. For longer PRs (e.g. something like 300+ line diffs) it can create a lot of rework for reviewers, but that's not the case here.

@aureliobarbosa
Copy link
Contributor Author

@rhshadrach Done here.

The intention of the commit below was just a merge but I incorporate the main code fix on it (single line), by mistake.
Merge remote-tracking branch 'upstream/main' into fix_gh55509

Also, I don't understand what is wrong with the doctest.

I am available for any question or fix.

@rhshadrach
Copy link
Member

@aureliobarbosa - thanks for the update. There are 2 questions/requests above that should be addressed.

Also, I don't understand what is wrong with the doctest.

The doctest is failing on DataFrame.to_xarray and this method does use DataFrame.update. It looks like this needs to be addressed.

@aureliobarbosa
Copy link
Contributor Author

@rhshadrach Thanks for the feedback. I will try to address this issue this weekend.

@aureliobarbosa
Copy link
Contributor Author

@rhshadrach I decide to move proposed fix to PR #57637 because there were a few problems on the commit history of this PR that I couldn't fix myself, and also because it was possible to take a new approach on the code there, which I think is simpler and more maintainable.

@mroeschke
Copy link
Member

Closing in favor of #57637

@mroeschke mroeschke closed this Feb 28, 2024
@aureliobarbosa aureliobarbosa deleted the fix_gh55509 branch March 4, 2024 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Blocking issue or pull request for an upcoming release Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DataFrame.update doesn't preserve dtypes
4 participants