Skip to content

ENH: ExtensionArray.setitem #19907

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Apr 16, 2018
Merged

ENH: ExtensionArray.setitem #19907

merged 21 commits into from
Apr 16, 2018

Conversation

TomAugspurger
Copy link
Contributor

xref #19696

Adds ExtensionBlock.__setitem__.

This is only tested for with DecimalArray and Categorical. Supporting something like JSONArray where the "scalar" elements are actual sequences runs into issues elsewhere in internals. Maybe someday we can support that.

@TomAugspurger TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Feb 26, 2018
@TomAugspurger TomAugspurger added this to the 0.23.0 milestone Feb 26, 2018
@@ -874,22 +875,7 @@ def setitem(self, indexer, value, mgr=None):
values = transf(values)

# length checking
# boolean with truth values == len of the value is ok too
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was refactored out to pandas/util/_validators since I needed it in ExtensionArray.setitem.

@@ -3489,7 +3484,8 @@ def apply(self, f, axes=None, filter=None, do_integrity_check=False,
# with a .values attribute.
aligned_args = dict((k, kwargs[k])
for k in align_keys
if hasattr(kwargs[k], 'values'))
if hasattr(kwargs[k], 'values') and
not isinstance(kwargs[k], ABCExtensionArray))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an ExtensionArray chooses to store it's data as .values, setitem would be broken without this extra check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make a test for this? (eg call the underlying data .values in of the example test arrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DecimalArray calls it's underlying data .values. I'm going to alias a few other attributes to that (.data, ._data, .items). Any others?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty special casey here. shouldn't this check for ._values?

Copy link
Contributor Author

@TomAugspurger TomAugspurger Mar 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I don't really know what could be in kwargs. You think it's only ever Index or Series? Or could it be a dataframe or block?

@TomAugspurger TomAugspurger mentioned this pull request Feb 26, 2018
15 tasks
@codecov
Copy link

codecov bot commented Feb 27, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@0bd8a5a). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #19907   +/-   ##
=========================================
  Coverage          ?   91.84%           
=========================================
  Files             ?      153           
  Lines             ?    49289           
  Branches          ?        0           
=========================================
  Hits              ?    45269           
  Misses            ?     4020           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.23% <100%> (?)
#single 41.9% <42.1%> (?)
Impacted Files Coverage Δ
pandas/core/internals.py 95.53% <100%> (ø)
pandas/core/frame.py 97.16% <100%> (ø)
pandas/core/indexing.py 93.08% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0bd8a5a...3cbe078. Read the comment docs.

@@ -1898,6 +1884,15 @@ def is_view(self):
"""Extension arrays are never treated as views."""
return False

def setitem(self, indexer, value, mgr=None):
if isinstance(indexer, tuple):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string

"""
from pandas.core.indexing import length_of_indexer

# boolean with truth values == len of the value is ok too
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is applicable in .where (in internals, but maybe also in frame.py). you may want to rename / clean that up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see anywhere in Block.where where similar logic was used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm maybe I cleaned this up a while back. IN any event this shouldn't be here at all. _validators is not the correct place. this is purely indexing validation. goes in core/indexing.py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't really say something about where it belongs (would have to look into more detail), but note that it was copied straight from internals.py, and it is still called there.
So in that sense this PR is just keeping the situation as it was.

@@ -3489,7 +3484,8 @@ def apply(self, f, axes=None, filter=None, do_integrity_check=False,
# with a .values attribute.
aligned_args = dict((k, kwargs[k])
for k in align_keys
if hasattr(kwargs[k], 'values'))
if hasattr(kwargs[k], 'values') and
not isinstance(kwargs[k], ABCExtensionArray))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make a test for this? (eg call the underlying data .values in of the example test arrays?

def test_setitem_expand_columns(self, data):
df = pd.DataFrame({"A": data})
df['B'] = 1
assert len(df.columns) == 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test here for expanding with data ?

df = pd.DataFrame({"A": [1]*len(data)})
df['B'] = data

and also one that overwrites an existing column with data

And for both, we should test both with df['col'] as with df.loc[:, 'col']

The subset of self.values to set
value : object
The value being set
mgr : BlockPlacement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional

The subset of self.values to set
value : object
The value being set
mgr : BlockPlacement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional

'indexer' is a direct slice/positional indexer. 'value' must
be a compatible shape.
"""
if isinstance(indexer, tuple):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, you are doing to for compatibility? this is not very robust and should be handled by the caller. indexing is quite tricky as the contracts for who converts what are almost all handled in core/indexing.py So the internals are so of you-know-what-you-are doing. Since you are exposing this almost directly, I think you need some handling routines (and am talking about the indexers here) IN EA, which actually call things in core/indexing.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, you are doing to for compatibility?

I guess, in the sense that things didn't work without it :) I think at some point, probably in indexing.py, the indexer is translated to a 2-D indexer, even though we know that EAs are always 1-D.

Since you are exposing this almost directly

I don't think the call stack is any different for setting on an ExtensionBlock vs. anything else.

I think you need some handling routines (and am talking about the indexers here) IN EA

IIUC, you're saying that the translation from a 2-D indexer to a 1-D indexer should happen elsewhere, yes? And the validation too? That sounds reasonable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback what place in core/indexing.py did you have in mind? _setitem_with_indexer?

I don't see anywhere there that deals with specific blocks, just whether or not the block managner is holding mixed types (split path or not).

"""
from pandas.core.indexing import length_of_indexer

# boolean with truth values == len of the value is ok too
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm maybe I cleaned this up a while back. IN any event this shouldn't be here at all. _validators is not the correct place. this is purely indexing validation. goes in core/indexing.py

@@ -71,3 +71,6 @@ def test_value_counts(self, all_data, dropna):

class TestCasting(base.BaseCastingTests):
pass

# We intentionally don't run base.BaseSetitemTests because pandas'
# internals has trouble setting sequences of values into scalar positions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also add the test class as above, but skip the full class (I think the pytest skip decorators also work for classes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been scared off mixing inheritance and skipping classes since seeing pytest-568, where skipping in a child marks all the other children as skips.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good reason :-)

"""
from pandas.core.indexing import length_of_indexer

# boolean with truth values == len of the value is ok too
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't really say something about where it belongs (would have to look into more detail), but note that it was copied straight from internals.py, and it is still called there.
So in that sense this PR is just keeping the situation as it was.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Mar 21, 2018

This currently fails when the repr is triggered, not in the setitem as it should

df = pd.DataFrame({"B": [1, 2, 3]})
df['A'] = IPArray([1, 2])
df._repr_html_()

@TomAugspurger
Copy link
Contributor Author

What would we expect the type of this to be

df = pd.DataFrame({"A": IPArray([1, 2, 3])})
df.loc[:, 'A'] = 1

This results in an integer dtype column. I think that's correct, right? The previous block is being entirely replaced? Then we have symmetry with df['A'] = 1.

@TomAugspurger
Copy link
Contributor Author

One unhandled case

In [3]: s = pd.Series(IPArray([1, 2]), index=[(0, 1), (1, 2)])

In [4]: s[(0, 1)] = 2
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/sandbox/pandas-ip/pandas/pandas/core/series.py in setitem(key, value)
    871             try:
--> 872                 self._set_with_engine(key, value)
    873                 return

~/sandbox/pandas-ip/pandas/pandas/core/series.py in _set_with_engine(self, key, value)
    930         try:
--> 931             self.index._engine.set_value(values, key, value)
    932             return

TypeError: Argument 'arr' has incorrect type (expected numpy.ndarray, got IPArray)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-4-6c3a0d7e7408> in <module>()
----> 1 s[(0, 1)] = 2

~/sandbox/pandas-ip/pandas/pandas/core/series.py in __setitem__(self, key, value)
    922         # do the setitem
    923         cacher_needs_updating = self._check_is_chained_assignment_possible()
--> 924         setitem(key, value)
    925         if cacher_needs_updating:
    926             self._maybe_update_cacher()

~/sandbox/pandas-ip/pandas/pandas/core/series.py in setitem(key, value)
    904                 if (isinstance(key, tuple) and
    905                         not isinstance(self.index, MultiIndex)):
--> 906                     raise ValueError("Can only tuple-index with a MultiIndex")
    907
    908                 # python 3 type errors should be raised

ValueError: Can only tuple-index with a MultiIndex

The first exception is the problem. That cython method is expecting an ndarray, but we're giving it an ndarray or extensionarray.

All our extension array setitem tests were hitting this. We only caught it
because only tuple reraised.
@TomAugspurger
Copy link
Contributor Author

Running asv with 43dfd7d now.

@TomAugspurger
Copy link
Contributor Author

The last commit allowing EAs in the index.engine.set* had trouble with us using DatetimeIndex as a storage container for Datetime with TZ. We tried to directly modify the DatetimeIndex with values[key] = value. Previously this wasn't hit because we raised a TypeError earlier on, sending us down a different code path. I think that someday we'll want something like 43dfd7d, but not right now.

@TomAugspurger
Copy link
Contributor Author

Hmm, perhaps it's worth just punting on __setitem__ on EAs with tuples for now. It's already failing on master for non-ndarrays

In [1]: import pandas as pd

In [2]: arr = pd.date_range('2017', periods=4, tz='US/Eastern')

In [3]: s = pd.Series(arr, index=[(0, 1), (0, 2), (0, 3), (0, 4)])

In [4]: s[(0, 1)] = float('NaN')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/sandbox/pandas-ip/pandas/pandas/core/series.py in setitem(key, value)
    871             try:
--> 872                 self._set_with_engine(key, value)
    873                 return

~/sandbox/pandas-ip/pandas/pandas/core/series.py in _set_with_engine(self, key, value)
    930         try:
--> 931             self.index._engine.set_value(values, key, value)
    932             return

TypeError: Argument 'arr' has incorrect type (expected numpy.ndarray, got DatetimeIndex)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-4-0a0fdcf8dcd6> in <module>()
----> 1 s[(0, 1)] = float('NaN')

~/sandbox/pandas-ip/pandas/pandas/core/series.py in __setitem__(self, key, value)
    922         # do the setitem
    923         cacher_needs_updating = self._check_is_chained_assignment_possible()
--> 924         setitem(key, value)
    925         if cacher_needs_updating:
    926             self._maybe_update_cacher()

~/sandbox/pandas-ip/pandas/pandas/core/series.py in setitem(key, value)
    904                 if (isinstance(key, tuple) and
    905                         not isinstance(self.index, MultiIndex)):
--> 906                     raise ValueError("Can only tuple-index with a MultiIndex")
    907
    908                 # python 3 type errors should be raised

ValueError: Can only tuple-index with a MultiIndex

With this hack, I can get it working

diff --git a/pandas/core/series.py b/pandas/core/series.py
index e48012420..cd412844e 100644
--- a/pandas/core/series.py
+++ b/pandas/core/series.py
@@ -869,6 +869,18 @@ class Series(base.IndexOpsMixin, generic.NDFrame):
 
         def setitem(key, value):
             try:
+                if self._data.blocks[0].is_extension:
+                    raise TypeError
                 self._set_with_engine(key, value)
                 return
             except com.SettingWithCopyError:

I don't think that's worthwhile including to support tuples in the index. I'd prefer to see a proper fix for whatever is going wrong there.

@jorisvandenbossche
Copy link
Member

perhaps it's worth just punting on setitem on EAs with tuples for now. It's already failing on master for non-ndarrays

No problem with that for now.

[on df.loc[:, 'A'] = 1] This results in an integer dtype column. I think that's correct, right? The previous block is being entirely replaced? Then we have symmetry with df['A'] = 1.

I thought we somehow handled df.loc[:, 'A'] = 1 and df['A'] = 1 differently right now, but it seems that also in case of df.loc[:, 'A'] = 1 with 'normal' dtypes like int and float, the full block gets replaced. In that case, yes, I think we want to do the same here.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!
Added some minor comments on the tests.

assert df.loc[10, 'B'] == data[1]

@pytest.mark.parametrize('as_callable', [True, False])
def test_set_mask_aligned(self, data, as_callable):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call this test_setitem_.. instead of test_set_... to be consistent with above

assert df.loc[10, 'B'] == data[1]

@pytest.mark.parametrize('as_callable', [True, False])
def test_set_mask_aligned(self, data, as_callable):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or parametrize this for setitem and loc?

assert ser[0] == data[5]
assert ser[1] == data[6]

def test_set_mask_broadcast(self, data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here (set -> setitem)


result = df.copy()
result.loc[:, 'B'] = 1
self.assert_frame_equal(result, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add overwriting the existing int B column with data ?

value = value.copy()
# Copy donesn't have any effect at the moment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you adding this comment? this is on-purpose

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment isn't clear, sorry. It's to answer "why call value.copy()" and then pass it to a function that takes a copy parameter. That's because __sanitize_index doesn't copy an EA, even with copy=True.

When the indexer is an ndarray or list and the lengths don't
match.
"""
from pandas.core.indexing import length_of_indexer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unecessary import

match.
"""
from pandas.core.indexing import length_of_indexer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to assert com.is_bool_indexer here (and you can actually remove some of this logic as it duplicates).

further should actually move is_bool_indexer to pandas.core.indexing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think com.is_bool_indexer serves a different purpose, and should already have been called typically when ending up in this place.
This helper function is only for generating informative error messages when the length does not match, com.is_bool_indexer does actual conversion and inference of the indexer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course, but its serving a very similar purpose to this and needs to be moved here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_bool_indexer seems to accept list-like, while this explicitly needs an ndarray since it checks indexer[indexer]. Given that I need to do isinstance(indexer, ndarray) anyway, to ensure that that boolean masking works, to tradeoff is

            if not (isinstance(indexer, np.ndarray) and
                    indexer.dtype == np.bool_ and
                    len(indexer[indexer]) == len(value)):
 

vs.

            if not (isinstance(indexer, np.ndarray) and
                    com.is_bool_indexer(indexer) and
                    len(indexer[indexer]) == len(value)):

Since the first is likely to be faster, I'd prefer to just go with that.

@@ -3489,7 +3484,8 @@ def apply(self, f, axes=None, filter=None, do_integrity_check=False,
# with a .values attribute.
aligned_args = dict((k, kwargs[k])
for k in align_keys
if hasattr(kwargs[k], 'values'))
if hasattr(kwargs[k], 'values') and
not isinstance(kwargs[k], ABCExtensionArray))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty special casey here. shouldn't this check for ._values?

@jreback jreback added the Indexing Related to indexing on series/frames, not to indexes themselves label Mar 25, 2018
@TomAugspurger
Copy link
Contributor Author

I had to remove the _values alias for now, as factorize uses the getattr(values, '_values', values) pattern to unbox a Series / Index.

@jreback
Copy link
Contributor

jreback commented Apr 1, 2018

can you rebase

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Apr 2, 2018 via email

@jreback
Copy link
Contributor

jreback commented Apr 14, 2018

@TomAugspurger

  • let's rebase one more time.
  • doc-strings looks good, but pls give a once over
  • I would collect all the issues that are EA and add onto the sub-section in whatsnew (just list them)

merge when ready-

@jreback jreback merged commit 1e4e04b into pandas-dev:master Apr 16, 2018
@jreback
Copy link
Contributor

jreback commented Apr 16, 2018

thanks @TomAugspurger

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants