BUG: reindexing a frame fails to pick nan value #8108

behzadnouri · 2014-08-25T02:48:37Z

currently

>>> df = pd.DataFrame([[1, 2], [3, 5], [7, 11], [9, 23]],
...         index=[2, np.nan, 1, 5], columns=['joe', 'jim'])
>>> df
     joe  jim
 2     1    2
NaN    3    5
 1     7   11
 5     9   23
>>> df.reindex([2, np.nan, 1])
     joe  jim
 2     1    2
NaN  NaN  NaN
 1     7   11

with this patch:

>>> df.reindex([2, np.nan, 1])
     joe  jim
 2     1    2
NaN    3    5
 1     7   11
>>> df.reindex([np.nan, 5, 5, np.nan, 1, 2, np.nan])
     joe  jim
NaN    3    5
 5     9   23
 5     9   23
NaN    3    5
 1     7   11
 2     1    2
NaN    3    5

Also, Float64HashTable cannot lookup nan values:

>>> from pandas.hashtable import Float64HashTable
>>> xs = np.array([2.718, 3.14, np.nan, -7, 5, 2, 3])
>>> m = Float64HashTable()
>>> m.map_locations(xs)
>>> m.lookup(xs)
array([ 0,  1, -1,  3,  4,  5,  6])

this is fixed to:

>>> m.map_locations(xs)
>>> m.lookup(xs)
array([0, 1, 2, 3, 4, 5, 6])
>>> np.all(m.lookup(xs) == np.arange(len(xs)))
True

same way that python built-in lists work:

>>> [2.718, 3.14, np.nan, -7, 5, 2, 3].index(np.nan)
2

jreback · 2014-08-25T02:52:19Z

having Nan's in an index is very odd
why would you do this?

behzadnouri · 2014-08-25T03:14:40Z

@jreback I do not think it is odd. Because a lot of times you do not really know that your data have null values, and you may do set_index on a column with null values, but then the results that you get is odd:

>>> df
   joe  jim  jolie
0    2    1      2
1  NaN    3      5
2    1    7     11
3    5    9     23
>>> df.set_index('joe').reindex(df['joe'].iloc[:-1])
     jim  jolie
joe            
 2     1      2
NaN  NaN    NaN
 1     7     11

also, regardless of nan in the index, kh_float64_hash_equal of the pandas.hashmap is wrong, because in python:

>>> {nan:5}[nan]
5
>>> [2, nan, 1, 5].index(nan)  # even though nan != nan
1

but not in pandas.hashtable.Float64HashTable:

>>> df.index
Float64Index([2.0, nan, 1.0, 5.0], dtype='float64')
>>> m = df.index._engine.mapping
>>> type(m)
<class 'pandas.hashtable.Float64HashTable'>
>>> m.lookup(np.array([2.0, np.nan, 5.0]))
array([ 0, -1,  3])

with this patch:

>>> m.lookup(np.array([2.0, np.nan, 5.0]))
array([0, 1, 3])

jreback · 2014-08-25T12:35:10Z

don't merge in master branches to this, rebase and force push instead, see here: https://github.com/pydata/pandas/wiki/Using-Git

jreback · 2014-08-25T12:37:17Z

this needs tests for other index types, e.g. Index(....dtype=object), and DatetimeIndex
needs test for the set_index above.

That said, really have to think about this. The problem with multi-nan in an index is not new, and causes trememdous indexing headaches. it prob should raise if their is more than one nan. Having a column is one thing, but indexing is impossible (and in fact it raises if their are 2 or more nans IIRC now).

behzadnouri · 2014-08-25T12:47:38Z

It is not about multi-nan because this branch of code only happens if the index is unique. If there are multi-nan or multi-anything the code does not hit this branch.

Consider this more of a fix to Float64HashTable than indexing.

behzadnouri · 2014-08-25T13:45:31Z

Below shows that objects work fine, even with exact same values:

>>> df.index
Float64Index([2.0, nan, 1.0, 5.0], dtype='float64')
>>> df.reindex([2, np.nan, 1])
     joe  jim
 2     1    2
NaN  NaN  NaN
 1     7   11
>>> df.index = df.index.astype('object')
>>> df.index
Index([2.0, nan, 1.0, 5.0], dtype='object')
>>> df.reindex([2, np.nan, 1])
     joe  jim
 2     1    2
NaN    3    5
 1     7   11

jreback · 2014-08-29T20:15:31Z

@cpcloud ok?

cpcloud · 2014-08-30T20:33:41Z

i guess so.

@behzadnouri your redefinition of the float64 hash equal function goes against the accepted definition of nan, and it's not clear to me what the use case is. After you call set_index and get an index with nans in it, what do you do with the resulting DataFrame/Series? What operations are made significantly easier or more convenient by allowing this?

Something like df.loc[np.nan:10] or even just df.loc[np.nan] defies my own intuition about nan because being able to index with something means being able to say that something can be equated to other things. nan is by definition not even equal to itself so I'm not sure what is to be gained here.

also, regardless of nan in the index, kh_float64_hash_equal of the pandas.hashmap is wrong, because in python:

No, khash preserves C semantics regarding nan whereas CPython doesn't in many cases, particularly when the computation involves checking whether nan is in a sequence. In CPython, rich comparison checks pointers and then calls objects' __eq__ methods. Let's look at the main loop in the C implementation of index:

    for (i = start; i < stop && i < Py_SIZE(self); i++) {
        int cmp = PyObject_RichCompareBool(self->ob_item[i], v, Py_EQ);
        if (cmp > 0)
            return PyLong_FromSsize_t(i);
        else if (cmp < 0)
            return NULL;
    }

This line

PyObject_RichCompareBool(self->ob_item[i], v, Py_EQ);

short-circuits when two pointers are equal (and returns 1), which means that it assumes id(x) == id(y) implies x == y. This is a completely reasonable assumption, except in the case of nan.

It also appears that you made the assumption that nan is a singleton. nan is not a singleton, so you can easily construct cases where you might think you'd get something back but you don't:

In [10]: import numpy as np

In [11]: npnan = np.nan

In [12]: pynan = float('nan')

In [13]: np.isnan(npnan)
Out[13]: True

In [14]: np.isnan(pynan)
Out[14]: True

In [15]: [pynan].index(npnan)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-e93f5bc41667> in <module>()
----> 1 [pynan].index(npnan)

ValueError: nan is not in list

All that said, if you feel really strongly about this and @jreback is okay with it, then merge away!

behzadnouri · 2014-08-30T22:51:29Z

The whole point is that the code should preserve the data as much as possible, and let the user of the library see and deal with his nan values, because a lot of times you do not know your data have null values, and that is a valuable red-flag which you wouldn't see if the code silently drops it.

Currently if I have a data-frame like this:

>>> df
   jim  joe  jolie
0    1    2      3
1  NaN    5      6
2    7    8      9

a simple operation like below causes loss of data:

>>> df.set_index('jim').reindex(df.loc[::-1, 'jim']).dropna()
     joe  jolie
jim            
7      8      9
1      2      3

whereas, semantically same operation gives different result:

>>> df.set_index('jim').iloc[::-1].dropna()
     joe  jolie
jim            
 7     8      9
NaN    5      6
 1     2      3

In pandas it is very convenient and easy to deal with nan/null values; you just make a call to fillna or dropna as appropriate and that's it.

But on the other hand, it does treat nan/null values in a way that you have to go scavenger hunt to find out why you lost one row of data as in above example.

This is another example. It is at odds with how database systems do group by operation; for example see here; quote from the page:

If the grouping column contains a null value, that row becomes a group in the results. If the grouping column contains more than one null value, the null values are put into a single group. This behavior is defined in the SQL-2003 standard.

The point of grouping null values (even though SELECT NULL = NULL returns NULL in data-base systems) is that the system carries null values through operations as far as possible, so the user can see at the end that his data have null values. The rationale being that the user knows his NULL values (and what to do with them) way better than what the data-base can guess. For example see "The NULL Spouse Example" on page 6 of this talk by Bruce Momjian.

This is another example of how treating one value differently from the rest of the litter, can break the code.

IMAO it is better to deviate from IEEE definition of nan values in favor of what would make more sense from a data analysis point of view.

cpcloud · 2014-08-30T23:05:38Z

@behzadnouri I'm on board with all of your points and I understand the issue.

What I don't understand (and maybe I should've said this from the start) is why you are using set_index on a column that you know may have nans in it? No matter what, indices with nan in them become markedly less useful.

More specifically why do you need to be able to have an Index with nan? Are you doing operations that require you to slice with nan values or index into the DataFrame/Series with nans?

Again, what operations are you doing with/on the Index that you cannot do conveniently in a Series?

If you think a column may have nans in it, then don't use set_index. Even if you could, what are you going to do with that after you do call set_index?

jreback · 2014-08-30T23:22:54Z

interjecting my 2c

currently nan groups are dropped FYI (this is a separate issue) - I think mainly because it can lead to having a nan indexer

which is generally to be avoided BECAUSE of a multitude of issues when trying to index (as cpcloud points out) - wesm originally punted I this - in recent years we have allowed this more and more

so actually ok with this pr - just let's make sure the semantics are similar to what we do for nan testing (eg cpcloud points)

that said - I wouldn't take sql S gospel nor really as a recommendation

they don't have any notion of indexing at all!

it's just return the result of a query (which is not the same!)

and oftentimes DB are for different purposes that data analysis and therefor do things differently

and generally sql is neither as flexible nor idiomatic as pandas and is generally NOT a good model

behzadnouri · 2014-08-30T23:23:38Z

again, it is about preserving the data.

Here, is an example which I am setting index on a column which has nan values:

>>> left
   jim  jolie
0    1      3
1  NaN      6
2    7      9
>>> right
   jim  joe
0    1    2
1  NaN    5
2    7    8
>>> left.join(right.set_index('jim'), on='jim')
   jim  jolie  joe
0    1      3    2
1  NaN      6    5
2    7      9    8

These are toy examples; Real world case would be a 500+ LOC script, which you either have to check for df.isnull().any() after each line of code, or hope that if something goes wrong you will see it in the output of script. If the code silently drops your data you would not see it.

jreback · 2015-03-02T11:50:47Z

@behzadnouri can you rebase this.

behzadnouri · 2015-03-05T14:46:55Z

@jreback rebased

BUG: reindexing a frame fails to pick nan value

jreback · 2015-03-05T23:43:47Z

@behzadnouri thanks again!

behzadnouri force-pushed the khash-nan branch 2 times, most recently from 6c3a299 to 424d98e Compare August 25, 2014 11:51

jreback added Indexing labels Aug 25, 2014

behzadnouri changed the title ~~BUG: reindexing a frame fails to pick nan values~~ BUG: reindexing a frame fails to pick nan value Aug 25, 2014

behzadnouri force-pushed the khash-nan branch 3 times, most recently from 3d4cbef to e0fc846 Compare August 26, 2014 01:22

jreback added this to the 0.15.0 milestone Aug 29, 2014

jreback modified the milestones: 0.15.1, 0.15.0 Sep 14, 2014

BUG: reindexing a frame fails to pick nan value

c9f2b3f

behzadnouri force-pushed the khash-nan branch from e0fc846 to c9f2b3f Compare March 3, 2015 02:03

jreback added a commit that referenced this pull request Mar 5, 2015

Merge pull request #8108 from behzadnouri/khash-nan

1fab6fc

BUG: reindexing a frame fails to pick nan value

jreback merged commit 1fab6fc into pandas-dev:master Mar 5, 2015

seth-p mentioned this pull request Mar 6, 2015

ENH/API: DataFrame.stack() support for level=None, sequentially=True/False, and NaN level values. #9023

Closed

behzadnouri deleted the khash-nan branch March 6, 2015 02:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: reindexing a frame fails to pick nan value #8108

BUG: reindexing a frame fails to pick nan value #8108

behzadnouri commented Aug 25, 2014

jreback commented Aug 25, 2014

behzadnouri commented Aug 25, 2014

jreback commented Aug 25, 2014

jreback commented Aug 25, 2014

behzadnouri commented Aug 25, 2014

behzadnouri commented Aug 25, 2014

jreback commented Aug 29, 2014

cpcloud commented Aug 30, 2014

behzadnouri commented Aug 30, 2014

cpcloud commented Aug 30, 2014

jreback commented Aug 30, 2014

behzadnouri commented Aug 30, 2014

jreback commented Mar 2, 2015

behzadnouri commented Mar 5, 2015

jreback commented Mar 5, 2015

BUG: reindexing a frame fails to pick nan value #8108

BUG: reindexing a frame fails to pick nan value #8108

Conversation

behzadnouri commented Aug 25, 2014

jreback commented Aug 25, 2014

behzadnouri commented Aug 25, 2014

jreback commented Aug 25, 2014

jreback commented Aug 25, 2014

behzadnouri commented Aug 25, 2014

behzadnouri commented Aug 25, 2014

jreback commented Aug 29, 2014

cpcloud commented Aug 30, 2014

behzadnouri commented Aug 30, 2014

cpcloud commented Aug 30, 2014

jreback commented Aug 30, 2014

behzadnouri commented Aug 30, 2014

jreback commented Mar 2, 2015

behzadnouri commented Mar 5, 2015

jreback commented Mar 5, 2015