ENH: ExtensionArray.setitem #19907

TomAugspurger · 2018-02-26T16:05:47Z

xref #19696

Adds ExtensionBlock.__setitem__.

This is only tested for with DecimalArray and Categorical. Supporting something like JSONArray where the "scalar" elements are actual sequences runs into issues elsewhere in internals. Maybe someday we can support that.

TomAugspurger · 2018-02-26T16:06:17Z

pandas/core/internals.py

@@ -874,22 +875,7 @@ def setitem(self, indexer, value, mgr=None):
        values = transf(values)

        # length checking
-        # boolean with truth values == len of the value is ok too


This was refactored out to pandas/util/_validators since I needed it in ExtensionArray.setitem.

TomAugspurger · 2018-02-26T16:07:02Z

pandas/core/internals.py

@@ -3489,7 +3484,8 @@ def apply(self, f, axes=None, filter=None, do_integrity_check=False,
        # with a .values attribute.
        aligned_args = dict((k, kwargs[k])
                            for k in align_keys
-                            if hasattr(kwargs[k], 'values'))
+                            if hasattr(kwargs[k], 'values') and
+                            not isinstance(kwargs[k], ABCExtensionArray))


If an ExtensionArray chooses to store it's data as .values, setitem would be broken without this extra check.

Should we make a test for this? (eg call the underlying data .values in of the example test arrays?

DecimalArray calls it's underlying data .values. I'm going to alias a few other attributes to that (.data, ._data, .items). Any others?

this is pretty special casey here. shouldn't this check for ._values?

I'm not sure. I don't really know what could be in kwargs. You think it's only ever Index or Series? Or could it be a dataframe or block?

codecov · 2018-02-27T02:22:42Z

Codecov Report

❗ No coverage uploaded for pull request base (master@0bd8a5a). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master   #19907   +/-   ##
=========================================
  Coverage          ?   91.84%           
=========================================
  Files             ?      153           
  Lines             ?    49289           
  Branches          ?        0           
=========================================
  Hits              ?    45269           
  Misses            ?     4020           
  Partials          ?        0

Flag	Coverage Δ
#multiple	`90.23% <100%> (?)`
#single	`41.9% <42.1%> (?)`

Impacted Files	Coverage Δ
pandas/core/internals.py	`95.53% <100%> (ø)`
pandas/core/frame.py	`97.16% <100%> (ø)`
pandas/core/indexing.py	`93.08% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0bd8a5a...3cbe078. Read the comment docs.

jreback · 2018-02-27T11:37:20Z

pandas/core/internals.py

@@ -1898,6 +1884,15 @@ def is_view(self):
        """Extension arrays are never treated as views."""
        return False

+    def setitem(self, indexer, value, mgr=None):
+        if isinstance(indexer, tuple):


can you add a doc-string

jreback · 2018-02-27T11:38:19Z

pandas/util/_validators.py

+    """
+    from pandas.core.indexing import length_of_indexer
+
+    # boolean with truth values == len of the value is ok too


this is applicable in .where (in internals, but maybe also in frame.py). you may want to rename / clean that up.

I didn't see anywhere in Block.where where similar logic was used.

hmm maybe I cleaned this up a while back. IN any event this shouldn't be here at all. _validators is not the correct place. this is purely indexing validation. goes in core/indexing.py

I can't really say something about where it belongs (would have to look into more detail), but note that it was copied straight from internals.py, and it is still called there.
So in that sense this PR is just keeping the situation as it was.

jorisvandenbossche · 2018-02-27T12:18:53Z

pandas/core/internals.py

@@ -3489,7 +3484,8 @@ def apply(self, f, axes=None, filter=None, do_integrity_check=False,
        # with a .values attribute.
        aligned_args = dict((k, kwargs[k])
                            for k in align_keys
-                            if hasattr(kwargs[k], 'values'))
+                            if hasattr(kwargs[k], 'values') and
+                            not isinstance(kwargs[k], ABCExtensionArray))


Should we make a test for this? (eg call the underlying data .values in of the example test arrays?

jorisvandenbossche · 2018-02-27T12:33:41Z

pandas/tests/extension/base/setitem.py

+    def test_setitem_expand_columns(self, data):
+        df = pd.DataFrame({"A": data})
+        df['B'] = 1
+        assert len(df.columns) == 2


Can you add a test here for expanding with data ?

df = pd.DataFrame({"A": [1]*len(data)}) df['B'] = data

and also one that overwrites an existing column with data

And for both, we should test both with df['col'] as with df.loc[:, 'col']

jreback · 2018-03-01T00:51:08Z

pandas/core/internals.py

+            The subset of self.values to set
+        value : object
+            The value being set
+        mgr : BlockPlacement


jreback · 2018-03-01T00:51:51Z

pandas/core/internals.py

+            The subset of self.values to set
+        value : object
+            The value being set
+        mgr : BlockPlacement


jreback · 2018-03-01T00:54:04Z

pandas/core/internals.py

+        'indexer' is a direct slice/positional indexer. 'value' must
+        be a compatible shape.
+        """
+        if isinstance(indexer, tuple):


hmm, you are doing to for compatibility? this is not very robust and should be handled by the caller. indexing is quite tricky as the contracts for who converts what are almost all handled in core/indexing.py So the internals are so of you-know-what-you-are doing. Since you are exposing this almost directly, I think you need some handling routines (and am talking about the indexers here) IN EA, which actually call things in core/indexing.py.

hmm, you are doing to for compatibility?

I guess, in the sense that things didn't work without it :) I think at some point, probably in indexing.py, the indexer is translated to a 2-D indexer, even though we know that EAs are always 1-D.

Since you are exposing this almost directly

I don't think the call stack is any different for setting on an ExtensionBlock vs. anything else.

I think you need some handling routines (and am talking about the indexers here) IN EA

IIUC, you're saying that the translation from a 2-D indexer to a 1-D indexer should happen elsewhere, yes? And the validation too? That sounds reasonable.

@jreback what place in core/indexing.py did you have in mind? _setitem_with_indexer?

I don't see anywhere there that deals with specific blocks, just whether or not the block managner is holding mixed types (split path or not).

jreback · 2018-03-01T01:04:32Z

pandas/util/_validators.py

+    """
+    from pandas.core.indexing import length_of_indexer
+
+    # boolean with truth values == len of the value is ok too


hmm maybe I cleaned this up a while back. IN any event this shouldn't be here at all. _validators is not the correct place. this is purely indexing validation. goes in core/indexing.py

jorisvandenbossche · 2018-03-01T08:39:58Z

pandas/tests/extension/json/test_json.py

@@ -71,3 +71,6 @@ def test_value_counts(self, all_data, dropna):

 class TestCasting(base.BaseCastingTests):
    pass
+
+# We intentionally don't run base.BaseSetitemTests because pandas'
+# internals has trouble setting sequences of values into scalar positions.


you can also add the test class as above, but skip the full class (I think the pytest skip decorators also work for classes)

I've been scared off mixing inheritance and skipping classes since seeing pytest-568, where skipping in a child marks all the other children as skips.

that's a good reason :-)

jorisvandenbossche · 2018-03-01T08:41:23Z

pandas/util/_validators.py

+    """
+    from pandas.core.indexing import length_of_indexer
+
+    # boolean with truth values == len of the value is ok too


I can't really say something about where it belongs (would have to look into more detail), but note that it was copied straight from internals.py, and it is still called there.
So in that sense this PR is just keeping the situation as it was.

TomAugspurger · 2018-03-21T11:52:40Z

This currently fails when the repr is triggered, not in the setitem as it should

df = pd.DataFrame({"B": [1, 2, 3]})
df['A'] = IPArray([1, 2])
df._repr_html_()

TomAugspurger · 2018-03-21T12:15:45Z

What would we expect the type of this to be

df = pd.DataFrame({"A": IPArray([1, 2, 3])})
df.loc[:, 'A'] = 1

This results in an integer dtype column. I think that's correct, right? The previous block is being entirely replaced? Then we have symmetry with df['A'] = 1.

TomAugspurger · 2018-03-21T15:42:06Z

One unhandled case

In [3]: s = pd.Series(IPArray([1, 2]), index=[(0, 1), (1, 2)])

In [4]: s[(0, 1)] = 2

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/sandbox/pandas-ip/pandas/pandas/core/series.py in setitem(key, value)
    871             try:
--> 872                 self._set_with_engine(key, value)
    873                 return

~/sandbox/pandas-ip/pandas/pandas/core/series.py in _set_with_engine(self, key, value)
    930         try:
--> 931             self.index._engine.set_value(values, key, value)
    932             return

TypeError: Argument 'arr' has incorrect type (expected numpy.ndarray, got IPArray)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-4-6c3a0d7e7408> in <module>()
----> 1 s[(0, 1)] = 2

~/sandbox/pandas-ip/pandas/pandas/core/series.py in __setitem__(self, key, value)
    922         # do the setitem
    923         cacher_needs_updating = self._check_is_chained_assignment_possible()
--> 924         setitem(key, value)
    925         if cacher_needs_updating:
    926             self._maybe_update_cacher()

~/sandbox/pandas-ip/pandas/pandas/core/series.py in setitem(key, value)
    904                 if (isinstance(key, tuple) and
    905                         not isinstance(self.index, MultiIndex)):
--> 906                     raise ValueError("Can only tuple-index with a MultiIndex")
    907
    908                 # python 3 type errors should be raised

ValueError: Can only tuple-index with a MultiIndex

The first exception is the problem. That cython method is expecting an ndarray, but we're giving it an ndarray or extensionarray.

All our extension array setitem tests were hitting this. We only caught it because only tuple reraised.

TomAugspurger · 2018-03-21T16:14:30Z

Running asv with 43dfd7d now.

TomAugspurger · 2018-03-21T17:20:56Z

The last commit allowing EAs in the index.engine.set* had trouble with us using DatetimeIndex as a storage container for Datetime with TZ. We tried to directly modify the DatetimeIndex with values[key] = value. Previously this wasn't hit because we raised a TypeError earlier on, sending us down a different code path. I think that someday we'll want something like 43dfd7d, but not right now.

TomAugspurger · 2018-03-21T18:28:31Z

Hmm, perhaps it's worth just punting on __setitem__ on EAs with tuples for now. It's already failing on master for non-ndarrays

In [1]: import pandas as pd

In [2]: arr = pd.date_range('2017', periods=4, tz='US/Eastern')

In [3]: s = pd.Series(arr, index=[(0, 1), (0, 2), (0, 3), (0, 4)])

In [4]: s[(0, 1)] = float('NaN')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/sandbox/pandas-ip/pandas/pandas/core/series.py in setitem(key, value)
    871             try:
--> 872                 self._set_with_engine(key, value)
    873                 return

~/sandbox/pandas-ip/pandas/pandas/core/series.py in _set_with_engine(self, key, value)
    930         try:
--> 931             self.index._engine.set_value(values, key, value)
    932             return

TypeError: Argument 'arr' has incorrect type (expected numpy.ndarray, got DatetimeIndex)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-4-0a0fdcf8dcd6> in <module>()
----> 1 s[(0, 1)] = float('NaN')

~/sandbox/pandas-ip/pandas/pandas/core/series.py in __setitem__(self, key, value)
    922         # do the setitem
    923         cacher_needs_updating = self._check_is_chained_assignment_possible()
--> 924         setitem(key, value)
    925         if cacher_needs_updating:
    926             self._maybe_update_cacher()

~/sandbox/pandas-ip/pandas/pandas/core/series.py in setitem(key, value)
    904                 if (isinstance(key, tuple) and
    905                         not isinstance(self.index, MultiIndex)):
--> 906                     raise ValueError("Can only tuple-index with a MultiIndex")
    907
    908                 # python 3 type errors should be raised

ValueError: Can only tuple-index with a MultiIndex

With this hack, I can get it working

diff --git a/pandas/core/series.py b/pandas/core/series.py
index e48012420..cd412844e 100644
--- a/pandas/core/series.py
+++ b/pandas/core/series.py
@@ -869,6 +869,18 @@ class Series(base.IndexOpsMixin, generic.NDFrame):
 
         def setitem(key, value):
             try:
+                if self._data.blocks[0].is_extension:
+                    raise TypeError
                 self._set_with_engine(key, value)
                 return
             except com.SettingWithCopyError:

I don't think that's worthwhile including to support tuples in the index. I'd prefer to see a proper fix for whatever is going wrong there.

jorisvandenbossche · 2018-03-23T10:50:38Z

perhaps it's worth just punting on setitem on EAs with tuples for now. It's already failing on master for non-ndarrays

No problem with that for now.

[on df.loc[:, 'A'] = 1] This results in an integer dtype column. I think that's correct, right? The previous block is being entirely replaced? Then we have symmetry with df['A'] = 1.

I thought we somehow handled df.loc[:, 'A'] = 1 and df['A'] = 1 differently right now, but it seems that also in case of df.loc[:, 'A'] = 1 with 'normal' dtypes like int and float, the full block gets replaced. In that case, yes, I think we want to do the same here.

jorisvandenbossche

Looks good to me!
Added some minor comments on the tests.

jorisvandenbossche · 2018-03-23T11:00:21Z

pandas/tests/extension/base/setitem.py

+        assert df.loc[10, 'B'] == data[1]
+
+    @pytest.mark.parametrize('as_callable', [True, False])
+    def test_set_mask_aligned(self, data, as_callable):


maybe call this test_setitem_.. instead of test_set_... to be consistent with above

jorisvandenbossche · 2018-03-23T11:00:41Z

pandas/tests/extension/base/setitem.py

+        assert df.loc[10, 'B'] == data[1]
+
+    @pytest.mark.parametrize('as_callable', [True, False])
+    def test_set_mask_aligned(self, data, as_callable):


or parametrize this for setitem and loc?

jorisvandenbossche · 2018-03-23T11:01:02Z

pandas/tests/extension/base/setitem.py

+        assert ser[0] == data[5]
+        assert ser[1] == data[6]
+
+    def test_set_mask_broadcast(self, data):


same here (set -> setitem)

jorisvandenbossche · 2018-03-23T11:02:31Z

pandas/tests/extension/base/setitem.py

+
+        result = df.copy()
+        result.loc[:, 'B'] = 1
+        self.assert_frame_equal(result, expected)


maybe add overwriting the existing int B column with data ?

jreback · 2018-03-25T14:11:33Z

pandas/core/frame.py

            value = value.copy()
+            # Copy donesn't have any effect at the moment


why are you adding this comment? this is on-purpose

The comment isn't clear, sorry. It's to answer "why call value.copy()" and then pass it to a function that takes a copy parameter. That's because __sanitize_index doesn't copy an EA, even with copy=True.

jreback · 2018-03-25T14:12:08Z

pandas/core/indexing.py

+        When the indexer is an ndarray or list and the lengths don't
+        match.
+    """
+    from pandas.core.indexing import length_of_indexer


unecessary import

jreback · 2018-03-25T14:14:04Z

pandas/core/indexing.py

+        match.
+    """
+    from pandas.core.indexing import length_of_indexer
+


you need to assert com.is_bool_indexer here (and you can actually remove some of this logic as it duplicates).

further should actually move is_bool_indexer to pandas.core.indexing

I think com.is_bool_indexer serves a different purpose, and should already have been called typically when ending up in this place.
This helper function is only for generating informative error messages when the length does not match, com.is_bool_indexer does actual conversion and inference of the indexer.

of course, but its serving a very similar purpose to this and needs to be moved here.

is_bool_indexer seems to accept list-like, while this explicitly needs an ndarray since it checks indexer[indexer]. Given that I need to do isinstance(indexer, ndarray) anyway, to ensure that that boolean masking works, to tradeoff is

if not (isinstance(indexer, np.ndarray) and indexer.dtype == np.bool_ and len(indexer[indexer]) == len(value)):

vs.

if not (isinstance(indexer, np.ndarray) and com.is_bool_indexer(indexer) and len(indexer[indexer]) == len(value)):

Since the first is likely to be faster, I'd prefer to just go with that.

jreback · 2018-03-25T14:16:00Z

pandas/core/internals.py

@@ -3489,7 +3484,8 @@ def apply(self, f, axes=None, filter=None, do_integrity_check=False,
        # with a .values attribute.
        aligned_args = dict((k, kwargs[k])
                            for k in align_keys
-                            if hasattr(kwargs[k], 'values'))
+                            if hasattr(kwargs[k], 'values') and
+                            not isinstance(kwargs[k], ABCExtensionArray))


this is pretty special casey here. shouldn't this check for ._values?

that was causing issues with factorize.

TomAugspurger · 2018-03-28T12:11:13Z

I had to remove the _values alias for now, as factorize uses the getattr(values, '_values', values) pattern to unbox a Series / Index.

jreback · 2018-04-01T14:26:16Z

can you rebase

TomAugspurger · 2018-04-02T10:34:11Z

I think all the tests have passed an there's no merge conflict.

…

On Sun, Apr 1, 2018 at 9:26 AM, Jeff Reback ***@***.***> wrote: can you rebase — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#19907 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIrvrkqmLuCjxyufLE-LJRyKtIlhVks5tkOOMgaJpZM4STfE9> .

jreback · 2018-04-14T13:46:31Z

@TomAugspurger

let's rebase one more time.
doc-strings looks good, but pls give a once over
I would collect all the issues that are EA and add onto the sub-section in whatsnew (just list them)

merge when ready-

jreback · 2018-04-16T10:33:26Z

thanks @TomAugspurger

ENH: ExtensionArray.setitem

b985ea8

TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Feb 26, 2018

TomAugspurger added this to the 0.23.0 milestone Feb 26, 2018

TomAugspurger commented Feb 26, 2018

View reviewed changes

TomAugspurger mentioned this pull request Feb 26, 2018

ExtensionArray meta-issue #19696

Closed

15 tasks

Avoid tm.assert

9489f6c

jreback requested changes Feb 27, 2018

View reviewed changes

Docstring

79da90e

jorisvandenbossche reviewed Feb 27, 2018

View reviewed changes

TomAugspurger added 4 commits February 27, 2018 10:00

Alias for common data attributes

7f65c5a

Merge remote-tracking branch 'upstream/master' into fu1+set

8ff6168

Additional tests

274da13

Linting

35ae908

jreback requested changes Mar 1, 2018

View reviewed changes

jorisvandenbossche reviewed Mar 1, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into fu1+set

c768709

TomAugspurger added 3 commits March 21, 2018 07:30

Move file

76f6e86

BUG: Fixed length validation

2d5b08c

Removed print

f66c093

Fixed setitem for tuples

43dfd7d

All our extension array setitem tests were hitting this. We only caught it because only tuple reraised.

Revert setitem changes

5c1d934

TomAugspurger mentioned this pull request Mar 21, 2018

Series.__setitem__ on DatetimeTZ values with tuples in the index fails #20441

Closed

TomAugspurger added 2 commits March 21, 2018 13:33

Xfail that test

66bbe9a

Linting

f47ddf2

jorisvandenbossche reviewed Mar 23, 2018

View reviewed changes

Test updates

1e5a14c

jreback requested changes Mar 25, 2018

View reviewed changes

jreback added the Indexing Related to indexing on series/frames, not to indexes themselves label Mar 25, 2018

TomAugspurger added 3 commits March 27, 2018 21:18

Import, comment

abe734d

Merge remote-tracking branch 'upstream/master' into fu1+set

10a3f19

Removed the _values alias

9a5b8c9

that was causing issues with factorize.

jreback approved these changes Apr 14, 2018

View reviewed changes

TomAugspurger added 2 commits April 15, 2018 13:58

Merge remote-tracking branch 'upstream/master' into fu1+set

202fae8

DOC: Fixup docstrings

3cbe078

jreback merged commit 1e4e04b into pandas-dev:master Apr 16, 2018

		value = value.copy()
		# Copy donesn't have any effect at the moment

ENH: ExtensionArray.setitem #19907

ENH: ExtensionArray.setitem #19907

Conversation

TomAugspurger commented Feb 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger Mar 28, 2018 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Feb 27, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Mar 21, 2018 • edited Loading

TomAugspurger commented Mar 21, 2018

TomAugspurger commented Mar 21, 2018

TomAugspurger commented Mar 21, 2018

TomAugspurger commented Mar 21, 2018

TomAugspurger commented Mar 21, 2018

jorisvandenbossche commented Mar 23, 2018

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Mar 28, 2018

jreback commented Apr 1, 2018

TomAugspurger commented Apr 2, 2018 via email

jreback commented Apr 14, 2018

jreback commented Apr 16, 2018

TomAugspurger Mar 28, 2018 •

edited

Loading

codecov bot commented Feb 27, 2018 •

edited

Loading

TomAugspurger commented Mar 21, 2018 •

edited

Loading