-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: IntervalIndex.get_loc/get_indexer wrong return value / error #25090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Revert pandas-dev#24048 change that caused bug.
Codecov Report
@@ Coverage Diff @@
## master #25090 +/- ##
=======================================
Coverage 92.37% 92.37%
=======================================
Files 166 166
Lines 52420 52420
=======================================
Hits 48423 48423
Misses 3997 3997
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25090 +/- ##
==========================================
+ Coverage 91.48% 91.89% +0.41%
==========================================
Files 175 175
Lines 52885 52495 -390
==========================================
- Hits 48380 48241 -139
+ Misses 4505 4254 -251
Continue to review full report at Codecov.
|
Write tests first
When tested with a variable that has the wrong dtype, this raises an exception instead of False
When supplied a variable with the wrong type get_loc should raise a KeyError (not type error). Otherwise things like checking if a variable is in an index will fail.
Hello @samuelsinayoko! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-04-10 07:20:10 UTC |
This is enough for making the test pass but it's not the right implementation
pandas/core/indexes/interval.py
Outdated
try: | ||
start, stop = self._find_non_overlapping_monotonic_bounds(key) | ||
except TypeError: | ||
# get loc should raise KeyError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: .get_loc()
.
pandas/core/indexes/interval.py
Outdated
# get loc should raise KeyError | ||
# if key is hashable but | ||
# of an incorrect type | ||
raise KeyError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be raise from
, which is a Python 3 construct. To stay version-agnostic use six.raise_from
.
raise KeyError | |
import six | |
... | |
try: | |
start, stop = self._find_non_overlapping_monotonic_bounds(key) | |
except TypeError as exc: | |
six.raise_from(KeyError('Key is hashable, but of an incorrect type'), exc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is six.raise_from
really necessary? I don't see us doing this anywhere else in the codebase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't use python 3 only constructs yet, the existing is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've left the code as is but have improved the reported message as suggested by @rs2
pandas/core/indexes/interval.py
Outdated
) | ||
except TypeError: | ||
# This is probably wrong | ||
# but not sure what I should do here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
¯\_(ツ)_/¯
pandas/core/indexes/interval.py
Outdated
except TypeError: | ||
# This is probably wrong | ||
# but not sure what I should do here | ||
return np.array([-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please comment on the choice of -1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be the same length as target
and of intp
dtype: np.repeat(np.intp(-1), len(target))
pandas/core/indexes/interval.py
Outdated
except TypeError: | ||
# This is probably wrong | ||
# but not sure what I should do here | ||
return np.array([-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be the same length as target
and of intp
dtype: np.repeat(np.intp(-1), len(target))
pandas/core/indexes/interval.py
Outdated
# get loc should raise KeyError | ||
# if key is hashable but | ||
# of an incorrect type | ||
raise KeyError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is six.raise_from
really necessary? I don't see us doing this anywhere else in the codebase?
@@ -886,6 +886,13 @@ def test_symmetric_difference(self, closed, sort): | |||
result = index.symmetric_difference(other, sort=sort) | |||
tm.assert_index_equal(result, expected) | |||
|
|||
def test_interval_range_get_indexer_with_different_input_type(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you rename to test_get_indexer_errors
and move to around line 618 where the other get_indexer
tests are?
@@ -886,6 +886,13 @@ def test_symmetric_difference(self, closed, sort): | |||
result = index.symmetric_difference(other, sort=sort) | |||
tm.assert_index_equal(result, expected) | |||
|
|||
def test_interval_range_get_indexer_with_different_input_type(self): | |||
# not sure about this one | |||
index = pd.interval_range(0, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you parametrize over index
and include an non-monotonic/overlapping IntervalIndex
, e.g. pd.IntervalIndex.from_tuples([(1, 3), (2, 4), (0, 2)])
index = pd.interval_range(0, 1) | ||
# behaviour should be the same as Int64Index and return an | ||
# array with values of -1 | ||
assert np.all(index.get_indexer(['gg']) == np.array([-1])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use tm.assert_numpy_array_equal
and make sure your expected is 'intp'
dtype.
pandas/tests/indexing/test_loc.py
Outdated
""" GH25087, test get_loc returns key error for interval indexes""" | ||
idx = pd.interval_range(0, 1.0) | ||
with pytest.raises(KeyError): | ||
idx.get_loc('gg') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of testing here, can you add this as a test case to test_get_loc_value
in pandas/tests/indexes/interval/test_interval.py
:
def test_get_loc_value(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot, I must say I wasn't completely clear about the distinction between indexes and indexing with regards to tests.
I've implemented your suggestion in 93f75ea.
@@ -766,8 +766,13 @@ def get_loc(self, key, method=None): | |||
key = Interval(left, right, key.closed) | |||
else: | |||
key = self._maybe_cast_slice_bound(key, 'left', None) | |||
|
|||
start, stop = self._find_non_overlapping_monotonic_bounds(key) | |||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll also need to do something similar in the else
branch to cover the overlapping/non-monotonic case, e.g. I think something like pd.IntervalIndex.from_tuples([(1, 3), (2, 4), (0, 2)]).get_loc('foo')
will still fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it should be the engine
that should properly raise a KeyError? (eg the int64 engine does that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to look into @jorisvandenbossche's comment on raising the error in the engine itself (especially if that's the behaviour for int64), but I think I've addressed everything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche : There is code in place within the engine
that raises a KeyError
, but strings queries fail before it gets there since the engine is expecting a scalar_t
type (fused type consisting of numeric types) for key
:
pandas/pandas/_libs/intervaltree.pxi.in
Line 104 in 2e38d55
def get_loc(self, scalar_t key): |
I'm not super well versed in Cython. Is there a graceful way to force this to raise a KeyError
within the Cython code? Removing the scalar_t
type gets a step further but still raises a TypeError
as the code expects things to be comparable (probably some perf implications to removing it too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the other engines have the key as object
typed, and then afterwards do a check of that.
But for me fine as well to leave that for now, and do the check here in the level above that. But on the long term would still be good to make the behaviour consistent throughout the different engines.
Thanks! To provide some additional context here: the indexing methods for That being said, this is still something that'd be viable for a 0.24.x change, since the new specs require breaking changes and would need to go in a major release (aiming for 0.25.0). |
pandas/core/indexes/interval.py
Outdated
# get loc should raise KeyError | ||
# if key is hashable but | ||
# of an incorrect type | ||
raise KeyError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't use python 3 only constructs yet, the existing is ok.
Include some non-monotonic/overlapping IntervalIndex. This triggers another bug, due to the fact that self.get_loc(i) is called on an unexpected key.
@@ -766,8 +766,13 @@ def get_loc(self, key, method=None): | |||
key = Interval(left, right, key.closed) | |||
else: | |||
key = self._maybe_cast_slice_bound(key, 'left', None) | |||
|
|||
start, stop = self._find_non_overlapping_monotonic_bounds(key) | |||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it should be the engine
that should properly raise a KeyError? (eg the int64 engine does that)
Move the test from indexing/test_loc to index/interval/test_interval.
target was missing from call to _find_non_overlapping_monotonic_bounds
try: | ||
start, stop = self._find_non_overlapping_monotonic_bounds(key) | ||
except TypeError: | ||
# get_loc should raise KeyError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add here a comment as Tom proposed (TODO(py3): use raise from.
) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can now use raise from (as we are PY3 only)
pandas/core/indexes/interval.py
Outdated
start, stop = self._find_non_overlapping_monotonic_bounds(key) | ||
except TypeError: | ||
# get_loc should raise KeyError | ||
raise KeyError('key is hashable but of incorrect type') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for key errors we typically just pass the key itself as message, so might be good to be consistent with that. Or at least I would include the key in the message, something like: "Key {0} is of the incorrect type".format(key)
pandas/core/indexes/interval.py
Outdated
try: | ||
return self._engine.get_loc(key) | ||
except TypeError: | ||
raise KeyError('No engine for key {!r}'.format(key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to mention "engine" here (that is something internal to pandas, while this error message will be visible for users)
pandas/core/indexes/interval.py
Outdated
self._find_non_overlapping_monotonic_bounds(target) | ||
) | ||
except TypeError: | ||
return np.repeat(np.int(-1), len(target)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jschendel I am not fully sure this will cover all cases (but what was there before also not).
target
can in principle a mixture of valid and invalid keys. So maybe the easiest would be to fall back to the else
branch that iterates through the elements separately in case this raises a TypeError.
That could be done by putting a pass
here, and putting the return
value of three lines below in the else part of a try/except/else
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry getting a bit confused here. You're saying that we should pass on line 833 here and left the code run to the else branch starting on line 851 (non IntervalIndex) where it loops over each element and appends -1 to the list if a KeyError is raised (I would probably add TypeError too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is basically what I meant I think.
So that if you do idx.get_indexer(['a', 1])
(where 1 is a valid key), you get [-0, 1] as result instead of [-1, -1] (assuming that idx.get_loc(1)
would return 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I've just pushed a commit implementing this. I had to make a few tweaks to the "non IntervalIndex" else branch (starting line 849 in core/indexes/interval.py) to make my two new tests pass, but hopefully haven't broken anything. See 2c48272
]) | ||
def test_get_indexer_errors(self, index): | ||
expected = np.array([-1], dtype='intp') | ||
assert tm.assert_numpy_array_equal(index.get_indexer(['gg']), expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you here also test multiple values and a mixture of values? Like get_indexer(['a', 'b']
and get_indexer([1, 'a'])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure will look into this over the weekend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see 02127ff
@@ -435,6 +435,14 @@ def test_get_loc_value(self): | |||
idx = IntervalIndex.from_arrays([0, 2], [1, 3]) | |||
pytest.raises(KeyError, idx.get_loc, 1.5) | |||
|
|||
# GH25087, test get_loc returns key error for interval indexes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this in a new test? (you can leave it in this place, but just put a def test_get_loc_invalid_key(self)
above this line)
Reason is that the other test is commented to be replaced, but this new test we want to keep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have got a few broken tests to fix. Hoping to make the build green in the coming days.
Instead of returning [-1, -1, -1] when the middle value is incorrect type, return [a, -1, b].
Add mix of invalid and valid values
Fixes test_with_overlaps test
interval.get_indexer() should still raise a TypeError in cases where the types are unorderable. This is needed for DataFrame.append for example, which was breaking tests in test_concat.
Any tips on making the tests pass on macOS, windows and Linux py27? |
@samuelsinayoko do those tests fail for you locally? If not, I can take a look later. |
will take another look at this tomorrow
…On Thu, 7 Mar 2019 at 18:00, Tom Augspurger ***@***.***> wrote:
@samuelsinayoko <https://github.com/samuelsinayoko> do those tests fail
for you locally? If not, I can take a look later.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25090 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AQuwvnphRf-gB8HLdg_PKKyjG_Yfqc15ks5vUVOkgaJpZM4afkZ_>
.
--
*Sam Sinayoko* | Data Scientist
*BMLL Technologies Ltd.*
10th Floor – Portland House
Bressenden Place
London
SW1E 5RS
+44(0)20 3828 9020
|
@TomAugspurger yes the tests pass locally on my linux machine. Not sure what's happening on Windows and OSX, so would be grateful for any help on this. |
@samuelsinayoko can you merge master? Recently dropped Py2 support so should make things easier |
Ok will do |
@samuelsinayoko can you merge mater and update |
closing as stale, if you'd like to continue, pls ping. |
git diff upstream/master -u -- "*.py" | flake8 --diff