-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
API / CoW: Copy NumPy arrays by default in DataFrame constructor #51731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 15 commits
563257e
f3161a3
3a95311
17cf5ae
07aa26d
8e84d85
f3ccf0f
5cdc6ad
3e384ea
49ee53f
fcc7be2
9223836
a474bf5
d5a0268
0be7fc6
293f8a5
3376d06
265d9e3
9ac1bae
db92ce4
be9cb04
4bf3ee8
e2eceec
65965c6
ecb756c
8e837d9
177fbbc
2bbff3b
5bef4ba
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,6 +15,8 @@ | |
import numpy as np | ||
from numpy import ma | ||
|
||
from pandas._config import using_copy_on_write | ||
|
||
from pandas._libs import lib | ||
|
||
from pandas.core.dtypes.cast import ( | ||
|
@@ -291,6 +293,16 @@ def ndarray_to_mgr( | |
if values.ndim == 1: | ||
values = values.reshape(-1, 1) | ||
|
||
elif ( | ||
using_copy_on_write() | ||
and isinstance(values, np.ndarray) | ||
and (dtype is None or is_dtype_equal(values.dtype, dtype)) | ||
and copy_on_sanitize | ||
): | ||
# switch to same memory layout as with blk.copy() | ||
values = np.array(values, order="F", copy=copy_on_sanitize) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is the order="F" thing something we should be doing in general in cases with copy=True? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are operating on the transposed array when copying a DataFrame object, so not needed in that case There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. xref #44871 took a look at preserving order when doing copy. there are some tradeoffs here |
||
values = _ensure_2d(values) | ||
|
||
elif isinstance(values, (np.ndarray, ExtensionArray, ABCSeries, Index)): | ||
# drop subclass info | ||
values = np.array(values, copy=copy_on_sanitize) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -47,8 +47,9 @@ def test_fillna_dict_inplace_nonunique_columns(self, using_copy_on_write): | |
def test_fillna_on_column_view(self, using_copy_on_write): | ||
# GH#46149 avoid unnecessary copies | ||
arr = np.full((40, 50), np.nan) | ||
df = DataFrame(arr) | ||
df = DataFrame(arr, copy=False) | ||
|
||
# TODO(CoW): This should raise a chained assignment error | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added this to the list in the overview issue #48998 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thx |
||
df[0].fillna(-1, inplace=True) | ||
if using_copy_on_write: | ||
assert np.isnan(arr[:, 0]).all() | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,11 +23,15 @@ def test_to_numpy_dtype(self): | |
tm.assert_numpy_array_equal(result, expected) | ||
|
||
@td.skip_array_manager_invalid_test | ||
def test_to_numpy_copy(self): | ||
def test_to_numpy_copy(self, using_copy_on_write): | ||
arr = np.random.randn(4, 3) | ||
df = DataFrame(arr) | ||
assert df.values.base is arr | ||
assert df.to_numpy(copy=False).base is arr | ||
if using_copy_on_write: | ||
assert df.values.base is not arr | ||
assert df.to_numpy(copy=False).base is not arr | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
else: | ||
assert df.values.base is arr | ||
assert df.to_numpy(copy=False).base is arr | ||
assert df.to_numpy(copy=True).base is not arr | ||
|
||
def test_to_numpy_mixed_dtype_to_str(self): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -119,7 +119,7 @@ def test_transpose_get_view(self, float_frame): | |
assert (float_frame.values[5:10] == 5).all() | ||
|
||
@td.skip_array_manager_invalid_test | ||
def test_transpose_get_view_dt64tzget_view(self): | ||
def test_transpose_get_view_dt64tzget_view(self, using_copy_on_write): | ||
dti = date_range("2016-01-01", periods=6, tz="US/Pacific") | ||
arr = dti._data.reshape(3, 2) | ||
df = DataFrame(arr) | ||
|
@@ -129,7 +129,10 @@ def test_transpose_get_view_dt64tzget_view(self): | |
assert result._mgr.nblocks == 1 | ||
|
||
rtrip = result._mgr.blocks[0].values | ||
assert np.shares_memory(arr._ndarray, rtrip._ndarray) | ||
if using_copy_on_write: | ||
assert not np.shares_memory(arr._ndarray, rtrip._ndarray) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similarly here, for the intent of the test, I think we should still try to verify that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
else: | ||
assert np.shares_memory(arr._ndarray, rtrip._ndarray) | ||
|
||
def test_transpose_not_inferring_dt(self): | ||
# GH#51546 | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -306,18 +306,24 @@ def test_constructor_dtype_nocast_view_2d_array( | |
assert df2._mgr.arrays[0].flags.c_contiguous | ||
|
||
@td.skip_array_manager_invalid_test | ||
def test_1d_object_array_does_not_copy(self): | ||
def test_1d_object_array_does_not_copy(self, using_copy_on_write): | ||
# https://github.com/pandas-dev/pandas/issues/39272 | ||
arr = np.array(["a", "b"], dtype="object") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or add There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure, done |
||
df = DataFrame(arr) | ||
assert np.shares_memory(df.values, arr) | ||
if using_copy_on_write: | ||
assert not np.shares_memory(df.values, arr) | ||
else: | ||
assert np.shares_memory(df.values, arr) | ||
|
||
@td.skip_array_manager_invalid_test | ||
def test_2d_object_array_does_not_copy(self): | ||
def test_2d_object_array_does_not_copy(self, using_copy_on_write): | ||
# https://github.com/pandas-dev/pandas/issues/39272 | ||
arr = np.array([["a", "b"], ["c", "d"]], dtype="object") | ||
df = DataFrame(arr) | ||
assert np.shares_memory(df.values, arr) | ||
if using_copy_on_write: | ||
assert not np.shares_memory(df.values, arr) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
else: | ||
assert np.shares_memory(df.values, arr) | ||
|
||
def test_constructor_dtype_list_data(self): | ||
df = DataFrame([[1, "2"], [None, "a"]], dtype=object) | ||
|
@@ -2107,13 +2113,18 @@ def test_constructor_frame_shallow_copy(self, float_frame): | |
cop.index = np.arange(len(cop)) | ||
tm.assert_frame_equal(float_frame, orig) | ||
|
||
def test_constructor_ndarray_copy(self, float_frame, using_array_manager): | ||
def test_constructor_ndarray_copy( | ||
self, float_frame, using_array_manager, using_copy_on_write | ||
): | ||
if not using_array_manager: | ||
arr = float_frame.values.copy() | ||
df = DataFrame(arr) | ||
|
||
arr[5] = 5 | ||
assert (df.values[5] == 5).all() | ||
if using_copy_on_write: | ||
assert not (df.values[5] == 5).all() | ||
else: | ||
assert (df.values[5] == 5).all() | ||
|
||
df = DataFrame(arr, copy=True) | ||
arr[6] = 6 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the order="F" specifically needed for CoW? (and can you add a comment about it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about #50756
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I understand that it's related to that, but I don't understand why we would only do it for CoW? The default is not to copy arrays (without CoW) at the moment, which creates this inefficient layout; but so if the user manually specifies copy=True in the constructor, why not directly copy it to the desired layout in all cases without the if/else check for CoW?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah did not think about this, will add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on @jbrockmendel s comment above I think we should leave it out for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave it out in general (from this PR), or you mean not do it for the default mode for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely default mode and maybe also split the copy and the layout change into two PRs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reiterating preference for this to be separate. The two of you are much more familiar with the CoW logic than most of the rest of us; i get antsy when i see using_copy_on_write checks appearing in new places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed all order relevant changes, is more straightforward now