-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: ExtensionArray.setitem #19907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: ExtensionArray.setitem #19907
Conversation
@@ -874,22 +875,7 @@ def setitem(self, indexer, value, mgr=None): | |||
values = transf(values) | |||
|
|||
# length checking | |||
# boolean with truth values == len of the value is ok too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was refactored out to pandas/util/_validators
since I needed it in ExtensionArray.setitem
.
@@ -3489,7 +3484,8 @@ def apply(self, f, axes=None, filter=None, do_integrity_check=False, | |||
# with a .values attribute. | |||
aligned_args = dict((k, kwargs[k]) | |||
for k in align_keys | |||
if hasattr(kwargs[k], 'values')) | |||
if hasattr(kwargs[k], 'values') and | |||
not isinstance(kwargs[k], ABCExtensionArray)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an ExtensionArray chooses to store it's data as .values
, setitem
would be broken without this extra check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make a test for this? (eg call the underlying data .values
in of the example test arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DecimalArray calls it's underlying data .values
. I'm going to alias a few other attributes to that (.data
, ._data
, .items
). Any others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty special casey here. shouldn't this check for ._values
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. I don't really know what could be in kwargs. You think it's only ever Index or Series? Or could it be a dataframe or block?
Codecov Report
@@ Coverage Diff @@
## master #19907 +/- ##
=========================================
Coverage ? 91.84%
=========================================
Files ? 153
Lines ? 49289
Branches ? 0
=========================================
Hits ? 45269
Misses ? 4020
Partials ? 0
Continue to review full report at Codecov.
|
pandas/core/internals.py
Outdated
@@ -1898,6 +1884,15 @@ def is_view(self): | |||
"""Extension arrays are never treated as views.""" | |||
return False | |||
|
|||
def setitem(self, indexer, value, mgr=None): | |||
if isinstance(indexer, tuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a doc-string
pandas/util/_validators.py
Outdated
""" | ||
from pandas.core.indexing import length_of_indexer | ||
|
||
# boolean with truth values == len of the value is ok too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is applicable in .where (in internals, but maybe also in frame.py). you may want to rename / clean that up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see anywhere in Block.where
where similar logic was used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm maybe I cleaned this up a while back. IN any event this shouldn't be here at all. _validators is not the correct place. this is purely indexing validation. goes in core/indexing.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't really say something about where it belongs (would have to look into more detail), but note that it was copied straight from internals.py
, and it is still called there.
So in that sense this PR is just keeping the situation as it was.
@@ -3489,7 +3484,8 @@ def apply(self, f, axes=None, filter=None, do_integrity_check=False, | |||
# with a .values attribute. | |||
aligned_args = dict((k, kwargs[k]) | |||
for k in align_keys | |||
if hasattr(kwargs[k], 'values')) | |||
if hasattr(kwargs[k], 'values') and | |||
not isinstance(kwargs[k], ABCExtensionArray)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make a test for this? (eg call the underlying data .values
in of the example test arrays?
def test_setitem_expand_columns(self, data): | ||
df = pd.DataFrame({"A": data}) | ||
df['B'] = 1 | ||
assert len(df.columns) == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test here for expanding with data
?
df = pd.DataFrame({"A": [1]*len(data)})
df['B'] = data
and also one that overwrites an existing column with data
And for both, we should test both with df['col']
as with df.loc[:, 'col']
pandas/core/internals.py
Outdated
The subset of self.values to set | ||
value : object | ||
The value being set | ||
mgr : BlockPlacement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional
pandas/core/internals.py
Outdated
The subset of self.values to set | ||
value : object | ||
The value being set | ||
mgr : BlockPlacement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional
pandas/core/internals.py
Outdated
'indexer' is a direct slice/positional indexer. 'value' must | ||
be a compatible shape. | ||
""" | ||
if isinstance(indexer, tuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, you are doing to for compatibility? this is not very robust and should be handled by the caller. indexing is quite tricky as the contracts for who converts what are almost all handled in core/indexing.py So the internals are so of you-know-what-you-are doing. Since you are exposing this almost directly, I think you need some handling routines (and am talking about the indexers here) IN EA, which actually call things in core/indexing.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, you are doing to for compatibility?
I guess, in the sense that things didn't work without it :) I think at some point, probably in indexing.py
, the indexer is translated to a 2-D indexer, even though we know that EAs are always 1-D.
Since you are exposing this almost directly
I don't think the call stack is any different for setting on an ExtensionBlock vs. anything else.
I think you need some handling routines (and am talking about the indexers here) IN EA
IIUC, you're saying that the translation from a 2-D indexer to a 1-D indexer should happen elsewhere, yes? And the validation too? That sounds reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback what place in core/indexing.py
did you have in mind? _setitem_with_indexer
?
I don't see anywhere there that deals with specific blocks, just whether or not the block managner is holding mixed types (split path or not).
pandas/util/_validators.py
Outdated
""" | ||
from pandas.core.indexing import length_of_indexer | ||
|
||
# boolean with truth values == len of the value is ok too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm maybe I cleaned this up a while back. IN any event this shouldn't be here at all. _validators is not the correct place. this is purely indexing validation. goes in core/indexing.py
@@ -71,3 +71,6 @@ def test_value_counts(self, all_data, dropna): | |||
|
|||
class TestCasting(base.BaseCastingTests): | |||
pass | |||
|
|||
# We intentionally don't run base.BaseSetitemTests because pandas' | |||
# internals has trouble setting sequences of values into scalar positions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also add the test class as above, but skip the full class (I think the pytest skip decorators also work for classes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been scared off mixing inheritance and skipping classes since seeing pytest-568, where skipping in a child marks all the other children as skips.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good reason :-)
pandas/util/_validators.py
Outdated
""" | ||
from pandas.core.indexing import length_of_indexer | ||
|
||
# boolean with truth values == len of the value is ok too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't really say something about where it belongs (would have to look into more detail), but note that it was copied straight from internals.py
, and it is still called there.
So in that sense this PR is just keeping the situation as it was.
This currently fails when the repr is triggered, not in the setitem as it should df = pd.DataFrame({"B": [1, 2, 3]})
df['A'] = IPArray([1, 2])
df._repr_html_() |
What would we expect the type of this to be df = pd.DataFrame({"A": IPArray([1, 2, 3])})
df.loc[:, 'A'] = 1 This results in an integer dtype column. I think that's correct, right? The previous block is being entirely replaced? Then we have symmetry with |
One unhandled case In [3]: s = pd.Series(IPArray([1, 2]), index=[(0, 1), (1, 2)])
In [4]: s[(0, 1)] = 2 ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~/sandbox/pandas-ip/pandas/pandas/core/series.py in setitem(key, value)
871 try:
--> 872 self._set_with_engine(key, value)
873 return
~/sandbox/pandas-ip/pandas/pandas/core/series.py in _set_with_engine(self, key, value)
930 try:
--> 931 self.index._engine.set_value(values, key, value)
932 return
TypeError: Argument 'arr' has incorrect type (expected numpy.ndarray, got IPArray)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-4-6c3a0d7e7408> in <module>()
----> 1 s[(0, 1)] = 2
~/sandbox/pandas-ip/pandas/pandas/core/series.py in __setitem__(self, key, value)
922 # do the setitem
923 cacher_needs_updating = self._check_is_chained_assignment_possible()
--> 924 setitem(key, value)
925 if cacher_needs_updating:
926 self._maybe_update_cacher()
~/sandbox/pandas-ip/pandas/pandas/core/series.py in setitem(key, value)
904 if (isinstance(key, tuple) and
905 not isinstance(self.index, MultiIndex)):
--> 906 raise ValueError("Can only tuple-index with a MultiIndex")
907
908 # python 3 type errors should be raised
ValueError: Can only tuple-index with a MultiIndex The first exception is the problem. That cython method is expecting an ndarray, but we're giving it an ndarray or extensionarray. |
All our extension array setitem tests were hitting this. We only caught it because only tuple reraised.
Running asv with 43dfd7d now. |
The last commit allowing EAs in the |
Hmm, perhaps it's worth just punting on In [1]: import pandas as pd
In [2]: arr = pd.date_range('2017', periods=4, tz='US/Eastern')
In [3]: s = pd.Series(arr, index=[(0, 1), (0, 2), (0, 3), (0, 4)])
In [4]: s[(0, 1)] = float('NaN')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~/sandbox/pandas-ip/pandas/pandas/core/series.py in setitem(key, value)
871 try:
--> 872 self._set_with_engine(key, value)
873 return
~/sandbox/pandas-ip/pandas/pandas/core/series.py in _set_with_engine(self, key, value)
930 try:
--> 931 self.index._engine.set_value(values, key, value)
932 return
TypeError: Argument 'arr' has incorrect type (expected numpy.ndarray, got DatetimeIndex)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-4-0a0fdcf8dcd6> in <module>()
----> 1 s[(0, 1)] = float('NaN')
~/sandbox/pandas-ip/pandas/pandas/core/series.py in __setitem__(self, key, value)
922 # do the setitem
923 cacher_needs_updating = self._check_is_chained_assignment_possible()
--> 924 setitem(key, value)
925 if cacher_needs_updating:
926 self._maybe_update_cacher()
~/sandbox/pandas-ip/pandas/pandas/core/series.py in setitem(key, value)
904 if (isinstance(key, tuple) and
905 not isinstance(self.index, MultiIndex)):
--> 906 raise ValueError("Can only tuple-index with a MultiIndex")
907
908 # python 3 type errors should be raised
ValueError: Can only tuple-index with a MultiIndex With this hack, I can get it working diff --git a/pandas/core/series.py b/pandas/core/series.py
index e48012420..cd412844e 100644
--- a/pandas/core/series.py
+++ b/pandas/core/series.py
@@ -869,6 +869,18 @@ class Series(base.IndexOpsMixin, generic.NDFrame):
def setitem(key, value):
try:
+ if self._data.blocks[0].is_extension:
+ raise TypeError
self._set_with_engine(key, value)
return
except com.SettingWithCopyError: I don't think that's worthwhile including to support tuples in the index. I'd prefer to see a proper fix for whatever is going wrong there. |
No problem with that for now.
I thought we somehow handled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Added some minor comments on the tests.
assert df.loc[10, 'B'] == data[1] | ||
|
||
@pytest.mark.parametrize('as_callable', [True, False]) | ||
def test_set_mask_aligned(self, data, as_callable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe call this test_setitem_..
instead of test_set_...
to be consistent with above
assert df.loc[10, 'B'] == data[1] | ||
|
||
@pytest.mark.parametrize('as_callable', [True, False]) | ||
def test_set_mask_aligned(self, data, as_callable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or parametrize this for setitem and loc?
assert ser[0] == data[5] | ||
assert ser[1] == data[6] | ||
|
||
def test_set_mask_broadcast(self, data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here (set -> setitem)
|
||
result = df.copy() | ||
result.loc[:, 'B'] = 1 | ||
self.assert_frame_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add overwriting the existing int B
column with data
?
pandas/core/frame.py
Outdated
value = value.copy() | ||
# Copy donesn't have any effect at the moment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you adding this comment? this is on-purpose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment isn't clear, sorry. It's to answer "why call value.copy()
" and then pass it to a function that takes a copy
parameter. That's because __sanitize_index
doesn't copy an EA, even with copy=True
.
pandas/core/indexing.py
Outdated
When the indexer is an ndarray or list and the lengths don't | ||
match. | ||
""" | ||
from pandas.core.indexing import length_of_indexer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unecessary import
pandas/core/indexing.py
Outdated
match. | ||
""" | ||
from pandas.core.indexing import length_of_indexer | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to assert com.is_bool_indexer
here (and you can actually remove some of this logic as it duplicates).
further should actually move is_bool_indexer
to pandas.core.indexing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think com.is_bool_indexer
serves a different purpose, and should already have been called typically when ending up in this place.
This helper function is only for generating informative error messages when the length does not match, com.is_bool_indexer
does actual conversion and inference of the indexer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course, but its serving a very similar purpose to this and needs to be moved here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_bool_indexer
seems to accept list-like, while this explicitly needs an ndarray since it checks indexer[indexer]
. Given that I need to do isinstance(indexer, ndarray)
anyway, to ensure that that boolean masking works, to tradeoff is
if not (isinstance(indexer, np.ndarray) and
indexer.dtype == np.bool_ and
len(indexer[indexer]) == len(value)):
vs.
if not (isinstance(indexer, np.ndarray) and
com.is_bool_indexer(indexer) and
len(indexer[indexer]) == len(value)):
Since the first is likely to be faster, I'd prefer to just go with that.
@@ -3489,7 +3484,8 @@ def apply(self, f, axes=None, filter=None, do_integrity_check=False, | |||
# with a .values attribute. | |||
aligned_args = dict((k, kwargs[k]) | |||
for k in align_keys | |||
if hasattr(kwargs[k], 'values')) | |||
if hasattr(kwargs[k], 'values') and | |||
not isinstance(kwargs[k], ABCExtensionArray)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty special casey here. shouldn't this check for ._values
?
that was causing issues with factorize.
I had to remove the |
can you rebase |
I think all the tests have passed an there's no merge conflict.
…On Sun, Apr 1, 2018 at 9:26 AM, Jeff Reback ***@***.***> wrote:
can you rebase
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#19907 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIrvrkqmLuCjxyufLE-LJRyKtIlhVks5tkOOMgaJpZM4STfE9>
.
|
merge when ready- |
thanks @TomAugspurger |
xref #19696
Adds
ExtensionBlock.__setitem__
.This is only tested for with
DecimalArray
andCategorical
. Supporting something likeJSONArray
where the "scalar" elements are actual sequences runs into issues elsewhere in internals. Maybe someday we can support that.