Skip to content

BUG: indexing with boolean-like Index, #11119 #11178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

BUG: indexing with boolean-like Index, #11119 #11178

wants to merge 1 commit into from

Conversation

preddy5
Copy link
Contributor

@preddy5 preddy5 commented Sep 23, 2015

closes #11119

@preddy5
Copy link
Contributor Author

preddy5 commented Sep 23, 2015

@jorisvandenbossche

@max-sixty
Copy link
Contributor

Does this only work where you supply two items for a DataFrame, because it's checking for ndim=len(key)? Do you want to be checking the length of the axis?

In [11]: df2=pd.concat([df, pd.Series([1,3,12,9],name=True)],axis=1)

In [12]: df2
Out[12]: 
   False  True   True 
0      6      3      1
1      1      9      3
2     13      8     12
3      8      2      9


In [13]: df2[[False]]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-b7f220d1c4ee> in <module>()
----> 1 df2[[False]]

/Users/maximilian/Dropbox/workspace/pandas/pandas/core/frame.py in __getitem__(self, key)
   1906         if isinstance(key, (Series, np.ndarray, Index, list)):
   1907             # either boolean or fancy integer index
-> 1908             return self._getitem_array(key)
   1909         elif isinstance(key, DataFrame):
   1910             return self._getitem_frame(key)

/Users/maximilian/Dropbox/workspace/pandas/pandas/core/frame.py in _getitem_array(self, key)
   1947                 else:
   1948                     raise ValueError('Item wrong length %d instead of %d.' %
-> 1949                                  (len(key), len(self.index)))
   1950             # check_bool_indexer will throw exception if Series key cannot
   1951             # be reindexed to match DataFrame rows

ValueError: Item wrong length 1 instead of 4.

In [17]: df2[[False, True, False]]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-88caa367ce94> in <module>()
----> 1 df2[[False, True, False]]

/Users/maximilian/Dropbox/workspace/pandas/pandas/core/frame.py in __getitem__(self, key)
   1906         if isinstance(key, (Series, np.ndarray, Index, list)):
   1907             # either boolean or fancy integer index
-> 1908             return self._getitem_array(key)
   1909         elif isinstance(key, DataFrame):
   1910             return self._getitem_frame(key)

/Users/maximilian/Dropbox/workspace/pandas/pandas/core/frame.py in _getitem_array(self, key)
   1947                 else:
   1948                     raise ValueError('Item wrong length %d instead of %d.' %
-> 1949                                  (len(key), len(self.index)))
   1950             # check_bool_indexer will throw exception if Series key cannot
   1951             # be reindexed to match DataFrame rows

ValueError: Item wrong length 3 instead of 4.

While if the column names are strings:

In [33]: df2_str[['False','True']]
Out[33]: 
   False  True  True
0      6     3     1
1      1     9     3
2     13     8    12
3      8     2     9

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Dtype Conversions Unexpected or buggy dtype conversions labels Sep 24, 2015
@jreback jreback changed the title issue #11119 BUG: indexing with boolean-like Index, #11119 Sep 24, 2015
@jreback
Copy link
Contributor

jreback commented Sep 24, 2015

What actually is needed is to infer this:

In [22]: pd.lib.infer_dtype(Index([True,False]))
Out[22]: 'boolean'

so right now we don't have a separate boolean Index type, its just object dtype.

So this particular case only is relevant when you:

  • have a boolean indexer
  • have a boolean Index on the same axis (so it must be object dtype), then you infer

@@ -1941,7 +1941,11 @@ def _getitem_array(self, key):
warnings.warn("Boolean Series key will be reindexed to match "
"DataFrame index.", UserWarning)
elif len(key) != len(self.index):
raise ValueError('Item wrong length %d instead of %d.' %
if lib.infer_dtype(self._info_axis)=="boolean":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback am I doing it right ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can move this check to _convert_to_indexer, and raise if their is a problem (e.g. subsume the ValueError there). makes this code a lot simpler. right now too much special casing.

@preddy5
Copy link
Contributor Author

preddy5 commented Sep 29, 2015

@jreback can you help me understand the error, I am unable to reproduce the error on my notebook and the error is only occurring for python2.7

@jreback
Copy link
Contributor

jreback commented Sep 29, 2015

its unrelated, a spurious failed. I restarted. In any event I will look at this prob next week. You are jumping thru lots of hoops here.

@preddy5
Copy link
Contributor Author

preddy5 commented Oct 7, 2015

@jreback Could you review the PR. Thanks

@shoyer
Copy link
Member

shoyer commented Oct 7, 2015

I'm not sure if this change is actually a good idea. Suppose we have a 2x2 DataFrame with columns and rows [True, False]:

In [3]: df = pd.DataFrame([[1, 2], [3, 4]], columns=[True, False], index=[True, False])

In [4]: df
Out[4]:
       True   False
True       1      2
False      3      4

What should df[[False, True]] return? Standard indexing with [] is a mess of fallback rules, so usually it's best to first think about what .loc and .iloc can handle.

Here, note that both .loc and .iloc claim to support boolean arrays as indexers. Given that indexes can include booleans, this is clearly an ambiguous case in our current rules. The current behavior may be a bug, but all the indexers seem to currently use booleans arrays for subsetting rows, not looking up index values:

In [5]: df[[True, False]]
Out[5]:
      True   False
True      1      2

In [9]: df.loc[[True, False]]
Out[9]:
      True   False
True      1      2

So my suggestion is that we should first consider deprecating using booleans that are not index values with .loc. Then, once this is entirely unambiguous, we can consider appropriate fallback rules for normal indexing.

@jreback
Copy link
Contributor

jreback commented Oct 7, 2015

deprecating boolean indexers for .loc is a non-starter as a) this would break all back-compat, b) this would make .loc completely non-intuitive and unusable with a boolean array, which is a core feature.

Supporting label indexing is currently a buggy edge case. I'll review soon.

@shoyer
Copy link
Member

shoyer commented Oct 8, 2015

I agree that this would be a backwards incompatibility break, but a pretty smooth deprecation cycle would be possible. I disagree that it is unintuitive for .loc to not work with boolean arrays, because .loc is explicitly for labeled based indexing. We've simply gotten away with it because using booleans for labels is pretty rare.

@jreback
Copy link
Contributor

jreback commented Oct 8, 2015

@shoyer I don't see any good reason to not allow .loc to accept boolean arrays. This would cause way more issues than it solves. In fact it is quite common to do this. Not sure why you are pushing this way. We want more consistency and unification, not less.

@jreback
Copy link
Contributor

jreback commented Nov 23, 2015

@pradyu1993 rebase this on master pls.

this needs a fair amount more testing. (of other cases of boolean labels).

@jreback
Copy link
Contributor

jreback commented Jan 11, 2016

closing. if you want to update according to comments. pls reopen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slicing multiple DataFrame columns doesn't work with boolean column names
5 participants