BUG: IntervalIndex.get_loc/get_indexer wrong return value / error #25090

samuelsinayoko · 2019-02-02T12:28:36Z

closes Regression in 0.24: TypeError exception when using dropna on dataframe with categorical index #25087
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Revert pandas-dev#24048 change that caused bug.

codecov · 2019-02-02T13:03:39Z

Codecov Report

Merging #25090 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #25090   +/-   ##
=======================================
  Coverage   92.37%   92.37%           
=======================================
  Files         166      166           
  Lines       52420    52420           
=======================================
  Hits        48423    48423           
  Misses       3997     3997

Flag	Coverage Δ
#multiple	`90.79% <100%> (ø)`	⬆️
#single	`42.88% <100%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/frame.py	`96.82% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb43726...f357101. Read the comment docs.

codecov · 2019-02-02T13:03:40Z

Codecov Report

Merging #25090 into master will increase coverage by 0.41%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #25090      +/-   ##
==========================================
+ Coverage   91.48%   91.89%   +0.41%     
==========================================
  Files         175      175              
  Lines       52885    52495     -390     
==========================================
- Hits        48380    48241     -139     
+ Misses       4505     4254     -251

Flag	Coverage Δ
#multiple	`90.45% <100%> (+0.4%)`	⬆️
#single	`40.74% <0%> (-1.09%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/interval.py	`95.56% <100%> (+0.3%)`	⬆️
pandas/io/clipboard/__init__.py	`39.21% <0%> (-17.65%)`	⬇️
pandas/io/clipboard/clipboards.py	`18.51% <0%> (-12.07%)`	⬇️
pandas/plotting/_compat.py	`83.33% <0%> (-4.17%)`	⬇️
pandas/core/config_init.py	`96.96% <0%> (-2.24%)`	⬇️
pandas/plotting/_style.py	`77.17% <0%> (-0.49%)`	⬇️
pandas/compat/numpy/__init__.py	`92.85% <0%> (-0.48%)`	⬇️
pandas/core/groupby/grouper.py	`98.18% <0%> (-0.35%)`	⬇️
pandas/core/computation/pytables.py	`90.24% <0%> (-0.31%)`	⬇️
pandas/plotting/_misc.py	`38.46% <0%> (-0.23%)`	⬇️
... and 93 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 324bb84...9f6b5c0. Read the comment docs.

Write tests first

When tested with a variable that has the wrong dtype, this raises an exception instead of False

When supplied a variable with the wrong type get_loc should raise a KeyError (not type error). Otherwise things like checking if a variable is in an index will fail.

pep8speaks · 2019-02-02T16:24:20Z

Hello @samuelsinayoko! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-04-10 07:20:10 UTC

This is enough for making the test pass but it's not the right implementation

pandas/core/frame.py

rs2 · 2019-02-02T19:01:50Z

pandas/core/indexes/interval.py

+            try:
+                start, stop = self._find_non_overlapping_monotonic_bounds(key)
+            except TypeError:
+                # get loc should raise KeyError


Typo: .get_loc().

rs2 · 2019-02-02T19:12:51Z

pandas/core/indexes/interval.py

+                # get loc should raise KeyError
+                # if key is hashable but
+                # of an incorrect type
+                raise KeyError


This needs to be raise from, which is a Python 3 construct. To stay version-agnostic use six.raise_from.

Suggested change

raise KeyError

import six

...

try:

start, stop = self._find_non_overlapping_monotonic_bounds(key)

except TypeError as exc:

six.raise_from(KeyError('Key is hashable, but of an incorrect type'), exc)

Is six.raise_from really necessary? I don't see us doing this anywhere else in the codebase?

we don't use python 3 only constructs yet, the existing is ok.

OK, I've left the code as is but have improved the reported message as suggested by @rs2

rs2 · 2019-02-02T19:13:33Z

pandas/core/indexes/interval.py

+                )
+            except TypeError:
+                # This is probably wrong
+                # but not sure what I should do here


¯\_(ツ)_/¯

rs2 · 2019-02-02T19:14:06Z

pandas/core/indexes/interval.py

+            except TypeError:
+                # This is probably wrong
+                # but not sure what I should do here
+                return np.array([-1])


Please comment on the choice of -1.

This needs to be the same length as target and of intp dtype: np.repeat(np.intp(-1), len(target))

rs2 · 2019-02-02T19:17:00Z

xref: #25091, #25087

jschendel · 2019-02-02T19:51:32Z

pandas/core/indexes/interval.py

+            except TypeError:
+                # This is probably wrong
+                # but not sure what I should do here
+                return np.array([-1])


This needs to be the same length as target and of intp dtype: np.repeat(np.intp(-1), len(target))

jschendel · 2019-02-02T19:53:18Z

pandas/core/indexes/interval.py

+                # get loc should raise KeyError
+                # if key is hashable but
+                # of an incorrect type
+                raise KeyError


Is six.raise_from really necessary? I don't see us doing this anywhere else in the codebase?

jschendel · 2019-02-02T19:57:37Z

pandas/tests/indexes/interval/test_interval.py

@@ -886,6 +886,13 @@ def test_symmetric_difference(self, closed, sort):
        result = index.symmetric_difference(other, sort=sort)
        tm.assert_index_equal(result, expected)

+    def test_interval_range_get_indexer_with_different_input_type(self):


can you rename to test_get_indexer_errors and move to around line 618 where the other get_indexer tests are?

jschendel · 2019-02-02T19:59:52Z

pandas/tests/indexes/interval/test_interval.py

@@ -886,6 +886,13 @@ def test_symmetric_difference(self, closed, sort):
        result = index.symmetric_difference(other, sort=sort)
        tm.assert_index_equal(result, expected)

+    def test_interval_range_get_indexer_with_different_input_type(self):
+        # not sure about this one
+        index = pd.interval_range(0, 1)


Can you parametrize over index and include an non-monotonic/overlapping IntervalIndex, e.g. pd.IntervalIndex.from_tuples([(1, 3), (2, 4), (0, 2)])

jschendel · 2019-02-02T20:00:34Z

pandas/tests/indexes/interval/test_interval.py

+        index = pd.interval_range(0, 1)
+        # behaviour should be the same as Int64Index and return an
+        # array with values of -1
+        assert np.all(index.get_indexer(['gg']) == np.array([-1]))


use tm.assert_numpy_array_equal and make sure your expected is 'intp' dtype.

jschendel · 2019-02-02T20:07:28Z

pandas/tests/indexing/test_loc.py

+        """ GH25087, test get_loc returns key error for interval indexes"""
+        idx = pd.interval_range(0, 1.0)
+        with pytest.raises(KeyError):
+            idx.get_loc('gg')


Instead of testing here, can you add this as a test case to test_get_loc_value in pandas/tests/indexes/interval/test_interval.py:

pandas/pandas/tests/indexes/interval/test_interval.py

Line 416 in a09a07e

def test_get_loc_value(self):

Good spot, I must say I wasn't completely clear about the distinction between indexes and indexing with regards to tests.
I've implemented your suggestion in 93f75ea.

jschendel · 2019-02-02T20:14:26Z

pandas/core/indexes/interval.py

@@ -766,8 +766,13 @@ def get_loc(self, key, method=None):
                key = Interval(left, right, key.closed)
            else:
                key = self._maybe_cast_slice_bound(key, 'left', None)
-
-            start, stop = self._find_non_overlapping_monotonic_bounds(key)
+            try:


You'll also need to do something similar in the else branch to cover the overlapping/non-monotonic case, e.g. I think something like pd.IntervalIndex.from_tuples([(1, 3), (2, 4), (0, 2)]).get_loc('foo') will still fail.

Maybe it should be the engine that should properly raise a KeyError? (eg the int64 engine does that)

I still need to look into @jorisvandenbossche's comment on raising the error in the engine itself (especially if that's the behaviour for int64), but I think I've addressed everything else.

@jorisvandenbossche : There is code in place within the engine that raises a KeyError, but strings queries fail before it gets there since the engine is expecting a scalar_t type (fused type consisting of numeric types) for key:

pandas/pandas/_libs/intervaltree.pxi.in

Line 104 in 2e38d55

def get_loc(self, scalar_t key):

I'm not super well versed in Cython. Is there a graceful way to force this to raise a KeyError within the Cython code? Removing the scalar_t type gets a step further but still raises a TypeError as the code expects things to be comparable (probably some perf implications to removing it too).

Yeah, the other engines have the key as object typed, and then afterwards do a check of that.
But for me fine as well to leave that for now, and do the check here in the level above that. But on the long term would still be good to make the behaviour consistent throughout the different engines.

jschendel · 2019-02-02T20:24:43Z

Thanks! To provide some additional context here: the indexing methods for IntervalIndex are currently in flux and a bit of a mess. There are specs to change the behavior (xref #16316) that I've been working through, albeit slowly due to not having much free time lately. In the process of doing so I've also been addressing stuff like this, removing the need for separate is_non_overlapping_monotonic branches, and generally cleaning things up.

That being said, this is still something that'd be viable for a 0.24.x change, since the new specs require breaking changes and would need to go in a major release (aiming for 0.25.0).

jreback · 2019-02-02T22:28:59Z

pandas/core/indexes/interval.py

+                # get loc should raise KeyError
+                # if key is hashable but
+                # of an incorrect type
+                raise KeyError


we don't use python 3 only constructs yet, the existing is ok.

Include some non-monotonic/overlapping IntervalIndex. This triggers another bug, due to the fact that self.get_loc(i) is called on an unexpected key.

pandas/tests/indexes/interval/test_interval.py

jorisvandenbossche · 2019-02-03T22:17:47Z

pandas/core/indexes/interval.py

@@ -766,8 +766,13 @@ def get_loc(self, key, method=None):
                key = Interval(left, right, key.closed)
            else:
                key = self._maybe_cast_slice_bound(key, 'left', None)
-
-            start, stop = self._find_non_overlapping_monotonic_bounds(key)
+            try:


Maybe it should be the engine that should properly raise a KeyError? (eg the int64 engine does that)

Move the test from indexing/test_loc to index/interval/test_interval.

target was missing from call to _find_non_overlapping_monotonic_bounds

jorisvandenbossche · 2019-02-04T07:45:03Z

pandas/core/indexes/interval.py

+            try:
+                start, stop = self._find_non_overlapping_monotonic_bounds(key)
+            except TypeError:
+                # get_loc should raise KeyError


Can you add here a comment as Tom proposed (TODO(py3): use raise from.) ?

you can now use raise from (as we are PY3 only)

jorisvandenbossche · 2019-02-04T07:46:56Z

pandas/core/indexes/interval.py

+                start, stop = self._find_non_overlapping_monotonic_bounds(key)
+            except TypeError:
+                # get_loc should raise KeyError
+                raise KeyError('key is hashable but of incorrect type')


I think for key errors we typically just pass the key itself as message, so might be good to be consistent with that. Or at least I would include the key in the message, something like: "Key {0} is of the incorrect type".format(key)

jorisvandenbossche · 2019-02-04T07:47:41Z

pandas/core/indexes/interval.py

+                try:
+                    return self._engine.get_loc(key)
+                except TypeError:
+                    raise KeyError('No engine for key {!r}'.format(key))


No need to mention "engine" here (that is something internal to pandas, while this error message will be visible for users)

jorisvandenbossche · 2019-02-04T07:51:09Z

pandas/core/indexes/interval.py

+                    self._find_non_overlapping_monotonic_bounds(target)
+                )
+            except TypeError:
+                return np.repeat(np.int(-1), len(target))


@jschendel I am not fully sure this will cover all cases (but what was there before also not).

target can in principle a mixture of valid and invalid keys. So maybe the easiest would be to fall back to the else branch that iterates through the elements separately in case this raises a TypeError.
That could be done by putting a pass here, and putting the return value of three lines below in the else part of a try/except/else.

Sorry getting a bit confused here. You're saying that we should pass on line 833 here and left the code run to the else branch starting on line 851 (non IntervalIndex) where it loops over each element and appends -1 to the list if a KeyError is raised (I would probably add TypeError too).

Yes, that is basically what I meant I think.

So that if you do idx.get_indexer(['a', 1]) (where 1 is a valid key), you get [-0, 1] as result instead of [-1, -1] (assuming that idx.get_loc(1) would return 1)

OK I've just pushed a commit implementing this. I had to make a few tweaks to the "non IntervalIndex" else branch (starting line 849 in core/indexes/interval.py) to make my two new tests pass, but hopefully haven't broken anything. See 2c48272

pandas/tests/indexes/interval/test_interval.py

jorisvandenbossche · 2019-02-04T07:53:33Z

pandas/tests/indexes/interval/test_interval.py

+    ])
+    def test_get_indexer_errors(self, index):
+        expected = np.array([-1], dtype='intp')
+        assert tm.assert_numpy_array_equal(index.get_indexer(['gg']), expected)


Can you here also test multiple values and a mixture of values? Like get_indexer(['a', 'b'] and get_indexer([1, 'a'])

Can you do this one?

Sure will look into this over the weekend.

see 02127ff

jorisvandenbossche · 2019-02-04T07:55:14Z

pandas/tests/indexes/interval/test_interval.py

@@ -435,6 +435,14 @@ def test_get_loc_value(self):
        idx = IntervalIndex.from_arrays([0, 2], [1, 3])
        pytest.raises(KeyError, idx.get_loc, 1.5)

+        # GH25087, test get_loc returns key error for interval indexes


Can you put this in a new test? (you can leave it in this place, but just put a def test_get_loc_invalid_key(self) above this line)
Reason is that the other test is commented to be replaced, but this new test we want to keep.

Can you do this one?

Have got a few broken tests to fix. Hoping to make the build green in the coming days.

Instead of returning [-1, -1, -1] when the middle value is incorrect type, return [a, -1, b].

Add mix of invalid and valid values

Fixes test_with_overlaps test

interval.get_indexer() should still raise a TypeError in cases where the types are unorderable. This is needed for DataFrame.append for example, which was breaking tests in test_concat.

samuelsinayoko · 2019-02-16T17:49:52Z

Any tips on making the tests pass on macOS, windows and Linux py27?

TomAugspurger · 2019-03-07T17:59:24Z

@samuelsinayoko do those tests fail for you locally? If not, I can take a look later.

samuelsinayoko · 2019-03-08T10:17:03Z

will take another look at this tomorrow

…

On Thu, 7 Mar 2019 at 18:00, Tom Augspurger ***@***.***> wrote: @samuelsinayoko <https://github.com/samuelsinayoko> do those tests fail for you locally? If not, I can take a look later. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#25090 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AQuwvnphRf-gB8HLdg_PKKyjG_Yfqc15ks5vUVOkgaJpZM4afkZ_> .

-- *Sam Sinayoko* | Data Scientist *BMLL Technologies Ltd.* 10th Floor – Portland House Bressenden Place London SW1E 5RS +44(0)20 3828 9020

samuelsinayoko · 2019-03-26T08:14:58Z

@TomAugspurger yes the tests pass locally on my linux machine. Not sure what's happening on Windows and OSX, so would be grateful for any help on this.

WillAyd · 2019-04-10T05:04:12Z

@samuelsinayoko can you merge master? Recently dropped Py2 support so should make things easier

samuelsinayoko · 2019-04-10T06:25:47Z

Ok will do

jreback · 2019-06-08T20:20:17Z

@samuelsinayoko can you merge mater and update

jreback · 2019-06-27T22:15:36Z

closing as stale, if you'd like to continue, pls ping.

samuelsinayoko added 3 commits February 2, 2019 12:25

Revert earlier change and use to_numpy

2aca389

Revert pandas-dev#24048 change that caused bug.

Remove warning by using to_numpy

2d12d2f

Make warning go away

f357101

samuelsinayoko added 5 commits February 2, 2019 15:56

revert initial bugfix

7cc4c37

Write tests first

Add test for contains in interval index categorical

8ec653a

When tested with a variable that has the wrong dtype, this raises an exception instead of False

Check get_loc on interval index raises KeyError

878802e

When supplied a variable with the wrong type get_loc should raise a KeyError (not type error). Otherwise things like checking if a variable is in an index will fail.

Add test for get_indexer

4dabe0e

Add a test for get_indexer with different type

564d88d

samuelsinayoko added 2 commits February 2, 2019 16:30

Make first two tests pass

f4c43e3

Make third test pass

a09a07e

This is enough for making the test pass but it's not the right implementation

samuelsinayoko mentioned this pull request Feb 2, 2019

Regression in 0.24: TypeError exception when using dropna on dataframe with categorical index #25087

Closed

jorisvandenbossche changed the title ~~23264~~ BUG: IntervalIndex.get_loc/get_indexer wrong return value / error Feb 2, 2019

rs2 suggested changes Feb 2, 2019

View reviewed changes

jschendel suggested changes Feb 2, 2019

View reviewed changes

jschendel added Indexing Related to indexing on series/frames, not to indexes themselves Interval Interval data type labels Feb 2, 2019

jreback requested changes Feb 2, 2019

View reviewed changes

samuelsinayoko added 4 commits February 3, 2019 18:46

remove commented out code

6c887e6

Improve error message

0a143f2

Rename, move and parametrize indexer test

0730cd6

Include some non-monotonic/overlapping IntervalIndex. This triggers another bug, due to the fact that self.get_loc(i) is called on an unexpected key.

Use numpy_array_equal in indexer test

246eb57

jorisvandenbossche reviewed Feb 3, 2019

View reviewed changes

samuelsinayoko added 2 commits February 4, 2019 07:02

Refactor interval index get_loc test

93f75ea

Move the test from indexing/test_loc to index/interval/test_interval.

Fix bug introduced in earlier commit

268db81

target was missing from call to _find_non_overlapping_monotonic_bounds

jorisvandenbossche reviewed Feb 4, 2019

View reviewed changes

samuelsinayoko added 4 commits February 6, 2019 07:51

Add reminder comment to use raise from for python 3

a5aa1e8

Include key in error message.

d480872

Add larger interval range to test

6ed1080

get_loc should raise KeyError if the supplied key has the wrong type

120e2bc

jorisvandenbossche added this to the 0.24.2 milestone Feb 7, 2019

samuelsinayoko added 4 commits February 10, 2019 17:45

Only return -1 in get_indexer for incorrect values

2c48272

Instead of returning [-1, -1, -1] when the middle value is incorrect type, return [a, -1, b].

Better tests for get_indexer_errors

02127ff

Add mix of invalid and valid values

Fix broken test in test_interval

ad13d9e

Fixes test_with_overlaps test

Fix broken tests in test_concat

0ff356c

interval.get_indexer() should still raise a TypeError in cases where the types are unorderable. This is needed for DataFrame.append for example, which was breaking tests in test_concat.

jreback modified the milestones: 0.24.2, Contributions Welcome Mar 3, 2019

Merge remote-tracking branch 'upstream/master' into 23264

9f6b5c0

jreback closed this Jun 27, 2019

-                raise KeyError
+import six
+...
+            try:
+                start, stop = self._find_non_overlapping_monotonic_bounds(key)
+            except TypeError as exc:
+                six.raise_from(KeyError('Key is hashable, but of an incorrect type'), exc)

BUG: IntervalIndex.get_loc/get_indexer wrong return value / error #25090

BUG: IntervalIndex.get_loc/get_indexer wrong return value / error #25090

Conversation

samuelsinayoko commented Feb 2, 2019 • edited by jorisvandenbossche Loading

codecov bot commented Feb 2, 2019

Codecov Report

codecov bot commented Feb 2, 2019 • edited Loading

Codecov Report

pep8speaks commented Feb 2, 2019 • edited Loading

Comment last updated at 2019-04-10 07:20:10 UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jschendel Feb 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rs2 commented Feb 2, 2019

Choose a reason for hiding this comment

jschendel Feb 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samuelsinayoko Feb 4, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jschendel commented Feb 2, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samuelsinayoko commented Feb 16, 2019

TomAugspurger commented Mar 7, 2019

samuelsinayoko commented Mar 8, 2019 via email

samuelsinayoko commented Mar 26, 2019

WillAyd commented Apr 10, 2019

samuelsinayoko commented Apr 10, 2019

jreback commented Jun 8, 2019

jreback commented Jun 27, 2019

samuelsinayoko commented Feb 2, 2019 •

edited by jorisvandenbossche

Loading

codecov bot commented Feb 2, 2019 •

edited

Loading

pep8speaks commented Feb 2, 2019 •

edited

Loading

jschendel Feb 2, 2019 •

edited

Loading

jschendel Feb 2, 2019 •

edited

Loading

samuelsinayoko Feb 4, 2019 •

edited

Loading