BUG: Fix dataframe.update not updating dtype #55509 v2 #57637

aureliobarbosa · 2024-02-27T02:56:34Z

closes BUG: DataFrame.update doesn't preserve dtypes #55509.
3 Tests added and passed.
All code checks passed.
Added an entry in the latest doc/source/whatsnew/v3.0.rst. (?)

A new PR was opened to replace the previous one (#55634) because a more idiomatic solution was found. The previous PR will be converted to draft.

@rhshadrach @mroeschke

…as-dev#55509)

pandas-dev#55509)

…andas-dev#55509

datapythonista · 2024-02-27T15:31:10Z

Thanks for the contribution @aureliobarbosa. Can you add a note to the release notes please? You can check other PRs if you are not familiar with it.

aureliobarbosa · 2024-02-27T15:46:35Z

Sure. For which version? 3.0? @datapythonista

datapythonista · 2024-02-27T15:49:43Z

Sure. For which version? 3.0? @datapythonista

Yes please

aureliobarbosa · 2024-02-27T16:59:36Z

Done @datapythonista

rhshadrach · 2024-02-27T21:36:21Z

Benchmarks from #55634 (comment)

size = 100_000
df1 = pd.DataFrame({'a': np.random.randint(0, 100, size)}, index=range(1, size+1))
df2 = pd.DataFrame({'a': np.random.randint(0, 100, 3)}, index=[10, 12, 11])
%timeit df1.update(df2)
# 1.11 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- PR
# 2.32 ms ± 36.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)    <-- main

size = 100_000
index = list(range(1, size+1))
index[1], index[0] = index[0], index[1]
df1 = pd.DataFrame({'a': np.random.randint(0, 100, size)}, index=index)
df2 = pd.DataFrame({'a': np.random.randint(0, 100, 3)}, index=[10, 12, 11])
%timeit df1.update(df2)
# 1.05 ms ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- PR
# 1.76 ms ± 3.73 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- main

size = 100_000
df1 = pd.DataFrame({'a': np.random.randint(0, 100, 3)}, index=[10, 11, 12])
df2 = pd.DataFrame({'a': np.random.randint(0, 100, size)})
%timeit df1.update(df2)
# 517 µs ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- PR
# 318 µs ± 4.85 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- main

size = 10
df1 = pd.DataFrame({'a': np.random.randint(0, 100, size)})
df2 = pd.DataFrame({'a': [-1, -2, -3]}, index=[4, 4, 6])
df1.update(df2)
# ValueError: Update not allowed with duplicate indexes on other.  <-- PR
# ValueError: cannot reindex on an axis with duplicate labels      <-- main

rhshadrach

lgtm - one request

rhshadrach · 2024-02-27T21:46:47Z

pandas/core/frame.py

@@ -8764,11 +8764,22 @@ def update(
        if not isinstance(other, DataFrame):
            other = DataFrame(other)

-        other = other.reindex(self.index)
+        if other.index.has_duplicates:
+            raise ValueError("Update not allowed with duplicate indexes on other.")


Can you add a note that this is not supported to the docstring. There isn't currently a notes section, something like this:

pandas/pandas/core/frame.py

Lines 1450 to 1454 in 737d390

Notes

-----

1. Because ``iterrows`` returns a Series for each row,

it does **not** preserve dtypes across the rows (dtypes are

preserved across columns for DataFrames).

@rhshadrach Done.

DataFrame.reindex raises on the same condition and does not mention it on the docs. What about adding a similar note there?

Slight preference on keeping this PR just modifying update, but no strong objection to modifying the reindex docs.

Ok. It can be done in another PR.

rhshadrach · 2024-02-28T21:55:27Z

/preview

github-actions · 2024-02-28T21:56:39Z

Website preview of this PR available at: https://pandas.pydata.org/preview/pandas-dev/pandas/57637/

rhshadrach

lgtm

datapythonista

Looks great, thanks for working on this. I added couple of comments.

Also, I was checking that we don't seem to have an automated benchmark to see if DataFrame.update is taking longer after this change. I don't think there should be a big impact, but maybe you can check locally how a sample case performs before and after this change (e.g. %timeit df.update(other_df)). Also, you are very welcome to add a benchmark (if you do it, maybe you can open another PR, and we can merge it first, so we can see if there is a regression here).

Thanks!

datapythonista · 2024-02-29T19:28:16Z

pandas/core/frame.py

@@ -8692,6 +8692,10 @@ def update(
        dict.update : Similar method for dictionaries.
        DataFrame.merge : For column(s)-on-column(s) operations.

+        Notes
+        --------


I think having the ------- longer than the title used to make the CI crash. Not sure why the CI is happy now, but maybe worth making Notes and it the same length in case it fails again in the future.

Now it has the same size.

datapythonista · 2024-02-29T19:29:43Z

pandas/core/frame.py

+        rows = other.index.intersection(self.index)
+        if rows.empty:
+            raise ValueError(
+                "Can't update dataframe when other has no index in common with "


No big deal, but in the previous error message you say "update not allowed" and here "can't update dataframe" to mean the same thing. Not sure how we usually write error messages, but I think it'd be better to be consistent.

Agree with keeping "update not allowed" because it states that it is possible but not allowed. Done,

datapythonista · 2024-02-29T19:33:27Z

pandas/tests/frame/methods/test_update.py

+    )
+    def test_update_preserve_dtype(self, value_df, value_other, dtype):
+        # GH#55509
+        df = DataFrame({"a": [value_df] * 2}, index=[1, 2])


You don't seem to use the dtype parameter in any fixture. Did you forget to specify it in the constructors, or am I missing something?

Thanks for the review and suggestions @datapythonista .

I will address all points as soon as possible (hope to finish by tomorrow).

Regarding the tests, it's an interesting that it is not there! I will take a more detailed look on them.

My question is, why do you have a parameter for the dtype in the pytest parametrization, if it's not being used? I had the impression that you are planning to set the dtype explicitly when creating the dataframe, but maybe you forgot. If it's not needed, maybe better to remove from the pytest parameters?

@datapythonista
You are absolutely right!

My tests were originally base on the tests proposed by @MichaelTiemannOSC. I tried to turn them simpler and similar to other tests in the same test set, but since they were failing on main and passing this PR I forgot about them.

This is addressed in the last two commits, which make two of the tests that I propose strictier. Nothing changes but those tests are better now! Thanks for the guidance on this point.

EDIT: I worked on this first because it looked more potentially problematic. As soon as possible the other points will be addressed.

Complementing, I think it is better to let the dtypes to be explicitly stated on those cases. The reason is that in this way it is possible to check dtype consistency for native python int, numpy, and Extended Arrays, as expected through the 'parametrization'.

…ks dtype

…andas-dev#55509)

aureliobarbosa · 2024-03-01T19:47:18Z

@datapythonista I am looking into ASV right now. Which kind of benchmark would be more useful: time, memory or both? Can you point me to a properly designed pandas benchmark so that I can have a reference for an start on this?

Independent of this, manual tests are going to be done locally and posted here for reference, in the following.

datapythonista

Do you think it's worth adding a use case to test when the indices have no intersection?

Also, no big deal, but not sure if we can find a better name for the variable rows. Maybe mask or index_intersection is better.

In any case, this looks good to me.

datapythonista · 2024-03-01T19:58:46Z

I think all our benchmarks should be reasonable. Understanding and running asv may be a bit tricky, but writing the benchmark should be trivial, just create some sample data and call DataFrame.update with it. I think checking time is probably good enough. The main complexity in my opinion is to use enough data so the timing is not super noise, but not have too much so the benchmark is slow.

Also, we use benchmarks more for the algorithms that we implement ourselves (like in C) that to check the performance of everything. I think in this case it can be useful, but use your own judgement to decide whether it's useful or not.

…-dev#55509).

aureliobarbosa · 2024-03-01T23:58:54Z

@datapythonista

Do you think it's worth adding a use case to test when the indices have no intersection?

I agree because this is a behavior that was introduced on this PR. Added the test for this case and changed the message raise slightly to emphasize 'no intersection'.

not sure if we can find a better name for the variable rows. Maybe mask or index_intersection is better.

The first of this PR used indexes_intersection, but then I changed to rows to simplify the notation. Now that you raised this point, I noted that a simple search on the file frame.py results in more than one hundred matches. Considering this, I agree with avoided using rows.

aureliobarbosa · 2024-03-02T01:50:31Z

I did benchmark again the first three manual code tests proposed previously by @rhshadrach. Results are summarized below in the same order.

Data	main	PR
monotonic	1.8 ms	1.29 ms
non-monotonic	1.83 ms	1.21 ms
small-frame-large-arg	384 µs	595 µs

The performance gain is not as good as originally expected, but they are still relevant (28%~34% on large frames with small arguments!). @datapythonista Note that when the frame is small and the argument is large there is a performance loss, but as pointed out previously by @rhshadrach this looks like an edge use case.

aureliobarbosa · 2024-03-02T02:50:05Z

Hope to be done here.

@datapythonista Next thing I will do is to try to implement an ASV benchmark for this function.

@rhshadrach I think that the take home message on this PR was that the function was manipulating data it wasn't expecting to touch on, and this case, the change in the dtype was the symptom for this problem.

If you both know or find any bug that could be related or similar to this one, I would be glad to contribute.

Regards

datapythonista

Excellent work @aureliobarbosa.

Do you want to have a last look @rhshadrach?

rhshadrach

lgtm

rhshadrach · 2024-03-02T18:57:35Z

Thanks @aureliobarbosa!

aureliobarbosa added 4 commits February 26, 2024 17:20

TST: add a test for preserving dtype while calling frame.update (pand…

93ea131

…as-dev#55509)

TST: Add a test for frame.update raising on duplicate argument indexes (

eb963df

pandas-dev#55509)

TST: frame.update accepts duplicate frame index with unique argument p…

d5a1085

…andas-dev#55509

BUG: fix dataframe.update not preserving dtypes (pandas-dev#55509)

02c1b77

aureliobarbosa changed the title ~~Fix gh55509 v2~~ Fix #GH55509 v2 Feb 27, 2024

aureliobarbosa changed the title ~~Fix #GH55509 v2~~ BUG: Fix dataframe.update not updating dtype#GH55509 v2 Feb 27, 2024

aureliobarbosa changed the title ~~BUG: Fix dataframe.update not updating dtype#GH55509 v2~~ BUG: Fix dataframe.update not updating dtype #55509 v2 Feb 27, 2024

aureliobarbosa mentioned this pull request Feb 27, 2024

BUG: DataFrame.update bool dtype being converted to object #55509 #55634

Closed

4 tasks

datapythonista added Bug Dtype Conversions Unexpected or buggy dtype conversions labels Feb 27, 2024

DOC: Add line indicating bug fix pandas-dev#55509

5169022

rhshadrach requested changes Feb 27, 2024

View reviewed changes

DOC: add note on duplicate indices on parameter other (pandas-dev#55509)

d7f63a1

aureliobarbosa requested a review from rhshadrach February 28, 2024 17:29

rhshadrach approved these changes Feb 28, 2024

View reviewed changes

datapythonista reviewed Feb 29, 2024

View reviewed changes

aureliobarbosa added 7 commits February 29, 2024 21:12

TST: assure test_update_preserve_dtype checks dtype

1fd2be9

TST: assure test_update_on_duplicate_frame_unique_argument_index chec…

d948e0a

…ks dtype

TST: Use np.intc instead of np.int64 on dtype tests for frame.update (p…

dbdff5c

…andas-dev#55509)

Merge remote-tracking branch 'upstream/main' into fix_gh55509_v2

b2c915f

Merge remote-tracking branch 'upstream/main' into fix_gh55509_v2

f9d601a

DOC: Fix separator size (pandas-dev#55509). Minor issue.

6de3fce

DOC: fix error messages on frame.update (PR pandas-dev#57637)

a62684f

datapythonista approved these changes Mar 1, 2024

View reviewed changes

aureliobarbosa added 2 commits March 1, 2024 20:53

TST: add test_update_raises_without_intersection on DataFrame (pandas…

8a28fb0

…-dev#55509).

Rename variable.

f08e92e

datapythonista approved these changes Mar 2, 2024

View reviewed changes

rhshadrach approved these changes Mar 2, 2024

View reviewed changes

rhshadrach added this to the 3.0 milestone Mar 2, 2024

rhshadrach merged commit 8fde168 into pandas-dev:main Mar 2, 2024
45 of 47 checks passed

aureliobarbosa deleted the fix_gh55509_v2 branch March 5, 2024 11:57

aureliobarbosa mentioned this pull request Apr 12, 2024

ASV: benchmark for DataFrame.Update #58228

Merged

5 tasks

pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024

BUG: dataframe.update coercing dtype (pandas-dev#57637)

b1aba2f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix dataframe.update not updating dtype #55509 v2 #57637

BUG: Fix dataframe.update not updating dtype #55509 v2 #57637

aureliobarbosa commented Feb 27, 2024 •

edited

Loading

datapythonista commented Feb 27, 2024

aureliobarbosa commented Feb 27, 2024 •

edited

Loading

datapythonista commented Feb 27, 2024

aureliobarbosa commented Feb 27, 2024

rhshadrach commented Feb 27, 2024

rhshadrach left a comment

rhshadrach Feb 27, 2024

aureliobarbosa Feb 28, 2024

rhshadrach Feb 28, 2024

aureliobarbosa Feb 29, 2024

rhshadrach commented Feb 28, 2024

github-actions bot commented Feb 28, 2024

rhshadrach left a comment

datapythonista left a comment

datapythonista Feb 29, 2024

aureliobarbosa Mar 1, 2024

datapythonista Feb 29, 2024

aureliobarbosa Mar 1, 2024

datapythonista Feb 29, 2024

aureliobarbosa Feb 29, 2024 •

edited

Loading

datapythonista Feb 29, 2024

aureliobarbosa Mar 1, 2024 •

edited

Loading

aureliobarbosa Mar 1, 2024

aureliobarbosa commented Mar 1, 2024

datapythonista left a comment

datapythonista commented Mar 1, 2024

aureliobarbosa commented Mar 1, 2024

aureliobarbosa commented Mar 2, 2024 •

edited

Loading

aureliobarbosa commented Mar 2, 2024

datapythonista left a comment

rhshadrach left a comment

rhshadrach commented Mar 2, 2024

	Notes
	-----
	1. Because ``iterrows`` returns a Series for each row,
	it does not preserve dtypes across the rows (dtypes are
	preserved across columns for DataFrames).

BUG: Fix dataframe.update not updating dtype #55509 v2 #57637

BUG: Fix dataframe.update not updating dtype #55509 v2 #57637

Conversation

aureliobarbosa commented Feb 27, 2024 • edited Loading

datapythonista commented Feb 27, 2024

aureliobarbosa commented Feb 27, 2024 • edited Loading

datapythonista commented Feb 27, 2024

aureliobarbosa commented Feb 27, 2024

rhshadrach commented Feb 27, 2024

rhshadrach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach commented Feb 28, 2024

github-actions bot commented Feb 28, 2024

rhshadrach left a comment

Choose a reason for hiding this comment

datapythonista left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aureliobarbosa Feb 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aureliobarbosa Mar 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aureliobarbosa commented Mar 1, 2024

datapythonista left a comment

Choose a reason for hiding this comment

datapythonista commented Mar 1, 2024

aureliobarbosa commented Mar 1, 2024

aureliobarbosa commented Mar 2, 2024 • edited Loading

aureliobarbosa commented Mar 2, 2024

datapythonista left a comment

Choose a reason for hiding this comment

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach commented Mar 2, 2024

aureliobarbosa commented Feb 27, 2024 •

edited

Loading

aureliobarbosa commented Feb 27, 2024 •

edited

Loading

aureliobarbosa Feb 29, 2024 •

edited

Loading

aureliobarbosa Mar 1, 2024 •

edited

Loading

aureliobarbosa commented Mar 2, 2024 •

edited

Loading