Skip to content

BUG: reindexing a frame fails to pick nan value #8108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 5, 2015

Conversation

behzadnouri
Copy link
Contributor

currently

>>> df = pd.DataFrame([[1, 2], [3, 5], [7, 11], [9, 23]],
...         index=[2, np.nan, 1, 5], columns=['joe', 'jim'])
>>> df
     joe  jim
 2     1    2
NaN    3    5
 1     7   11
 5     9   23
>>> df.reindex([2, np.nan, 1])
     joe  jim
 2     1    2
NaN  NaN  NaN
 1     7   11

with this patch:

>>> df.reindex([2, np.nan, 1])
     joe  jim
 2     1    2
NaN    3    5
 1     7   11
>>> df.reindex([np.nan, 5, 5, np.nan, 1, 2, np.nan])
     joe  jim
NaN    3    5
 5     9   23
 5     9   23
NaN    3    5
 1     7   11
 2     1    2
NaN    3    5

Also, Float64HashTable cannot lookup nan values:

>>> from pandas.hashtable import Float64HashTable
>>> xs = np.array([2.718, 3.14, np.nan, -7, 5, 2, 3])
>>> m = Float64HashTable()
>>> m.map_locations(xs)
>>> m.lookup(xs)
array([ 0,  1, -1,  3,  4,  5,  6])

this is fixed to:

>>> m.map_locations(xs)
>>> m.lookup(xs)
array([0, 1, 2, 3, 4, 5, 6])
>>> np.all(m.lookup(xs) == np.arange(len(xs)))
True 

same way that python built-in lists work:

>>> [2.718, 3.14, np.nan, -7, 5, 2, 3].index(np.nan)
2

@jreback
Copy link
Contributor

jreback commented Aug 25, 2014

having Nan's in an index is very odd
why would you do this?

@behzadnouri
Copy link
Contributor Author

@jreback I do not think it is odd. Because a lot of times you do not really know that your data have null values, and you may do set_index on a column with null values, but then the results that you get is odd:

>>> df
   joe  jim  jolie
0    2    1      2
1  NaN    3      5
2    1    7     11
3    5    9     23
>>> df.set_index('joe').reindex(df['joe'].iloc[:-1])
     jim  jolie
joe            
 2     1      2
NaN  NaN    NaN
 1     7     11

also, regardless of nan in the index, kh_float64_hash_equal of the pandas.hashmap is wrong, because in python:

>>> {nan:5}[nan]
5
>>> [2, nan, 1, 5].index(nan)  # even though nan != nan
1

but not in pandas.hashtable.Float64HashTable:

>>> df.index
Float64Index([2.0, nan, 1.0, 5.0], dtype='float64')
>>> m = df.index._engine.mapping
>>> type(m)
<class 'pandas.hashtable.Float64HashTable'>
>>> m.lookup(np.array([2.0, np.nan, 5.0]))
array([ 0, -1,  3])

with this patch:

>>> m.lookup(np.array([2.0, np.nan, 5.0]))
array([0, 1, 3])

@behzadnouri behzadnouri force-pushed the khash-nan branch 2 times, most recently from 6c3a299 to 424d98e Compare August 25, 2014 11:51
@jreback
Copy link
Contributor

jreback commented Aug 25, 2014

don't merge in master branches to this, rebase and force push instead, see here: https://github.com/pydata/pandas/wiki/Using-Git

@jreback
Copy link
Contributor

jreback commented Aug 25, 2014

  • this needs tests for other index types, e.g. Index(....dtype=object), and DatetimeIndex
  • needs test for the set_index above.

That said, really have to think about this. The problem with multi-nan in an index is not new, and causes trememdous indexing headaches. it prob should raise if their is more than one nan. Having a column is one thing, but indexing is impossible (and in fact it raises if their are 2 or more nans IIRC now).

@behzadnouri
Copy link
Contributor Author

It is not about multi-nan because this branch of code only happens if the index is unique. If there are multi-nan or multi-anything the code does not hit this branch.

Consider this more of a fix to Float64HashTable than indexing.

@behzadnouri behzadnouri changed the title BUG: reindexing a frame fails to pick nan values BUG: reindexing a frame fails to pick nan value Aug 25, 2014
@behzadnouri
Copy link
Contributor Author

Below shows that objects work fine, even with exact same values:

>>> df.index
Float64Index([2.0, nan, 1.0, 5.0], dtype='float64')
>>> df.reindex([2, np.nan, 1])
     joe  jim
 2     1    2
NaN  NaN  NaN
 1     7   11
>>> df.index = df.index.astype('object')
>>> df.index
Index([2.0, nan, 1.0, 5.0], dtype='object')
>>> df.reindex([2, np.nan, 1])
     joe  jim
 2     1    2
NaN    3    5
 1     7   11

@behzadnouri behzadnouri force-pushed the khash-nan branch 3 times, most recently from 3d4cbef to e0fc846 Compare August 26, 2014 01:22
@jreback jreback added this to the 0.15.0 milestone Aug 29, 2014
@jreback
Copy link
Contributor

jreback commented Aug 29, 2014

@cpcloud ok?

@cpcloud
Copy link
Member

cpcloud commented Aug 30, 2014

i guess so.

@behzadnouri your redefinition of the float64 hash equal function goes against the accepted definition of nan, and it's not clear to me what the use case is. After you call set_index and get an index with nans in it, what do you do with the resulting DataFrame/Series? What operations are made significantly easier or more convenient by allowing this?

Something like df.loc[np.nan:10] or even just df.loc[np.nan] defies my own intuition about nan because being able to index with something means being able to say that something can be equated to other things. nan is by definition not even equal to itself so I'm not sure what is to be gained here.

also, regardless of nan in the index, kh_float64_hash_equal of the pandas.hashmap is wrong, because in python:

No, khash preserves C semantics regarding nan whereas CPython doesn't in many cases, particularly when the computation involves checking whether nan is in a sequence. In CPython, rich comparison checks pointers and then calls objects' __eq__ methods. Let's look at the main loop in the C implementation of index:

    for (i = start; i < stop && i < Py_SIZE(self); i++) {
        int cmp = PyObject_RichCompareBool(self->ob_item[i], v, Py_EQ);
        if (cmp > 0)
            return PyLong_FromSsize_t(i);
        else if (cmp < 0)
            return NULL;
    }

This line

PyObject_RichCompareBool(self->ob_item[i], v, Py_EQ);

short-circuits when two pointers are equal (and returns 1), which means that it assumes id(x) == id(y) implies x == y. This is a completely reasonable assumption, except in the case of nan.

It also appears that you made the assumption that nan is a singleton. nan is not a singleton, so you can easily construct cases where you might think you'd get something back but you don't:

In [10]: import numpy as np

In [11]: npnan = np.nan

In [12]: pynan = float('nan')

In [13]: np.isnan(npnan)
Out[13]: True

In [14]: np.isnan(pynan)
Out[14]: True

In [15]: [pynan].index(npnan)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-e93f5bc41667> in <module>()
----> 1 [pynan].index(npnan)

ValueError: nan is not in list

All that said, if you feel really strongly about this and @jreback is okay with it, then merge away!

@behzadnouri
Copy link
Contributor Author

The whole point is that the code should preserve the data as much as possible, and let the user of the library see and deal with his nan values, because a lot of times you do not know your data have null values, and that is a valuable red-flag which you wouldn't see if the code silently drops it.

Currently if I have a data-frame like this:

>>> df
   jim  joe  jolie
0    1    2      3
1  NaN    5      6
2    7    8      9

a simple operation like below causes loss of data:

>>> df.set_index('jim').reindex(df.loc[::-1, 'jim']).dropna()
     joe  jolie
jim            
7      8      9
1      2      3

whereas, semantically same operation gives different result:

>>> df.set_index('jim').iloc[::-1].dropna()
     joe  jolie
jim            
 7     8      9
NaN    5      6
 1     2      3

In pandas it is very convenient and easy to deal with nan/null values; you just make a call to fillna or dropna as appropriate and that's it.

But on the other hand, it does treat nan/null values in a way that you have to go scavenger hunt to find out why you lost one row of data as in above example.

This is another example. It is at odds with how database systems do group by operation; for example see here; quote from the page:

If the grouping column contains a null value, that row becomes a group in the results. If the grouping column contains more than one null value, the null values are put into a single group. This behavior is defined in the SQL-2003 standard.

The point of grouping null values (even though SELECT NULL = NULL returns NULL in data-base systems) is that the system carries null values through operations as far as possible, so the user can see at the end that his data have null values. The rationale being that the user knows his NULL values (and what to do with them) way better than what the data-base can guess. For example see "The NULL Spouse Example" on page 6 of this talk by Bruce Momjian.

This is another example of how treating one value differently from the rest of the litter, can break the code.

IMAO it is better to deviate from IEEE definition of nan values in favor of what would make more sense from a data analysis point of view.

@cpcloud
Copy link
Member

cpcloud commented Aug 30, 2014

@behzadnouri I'm on board with all of your points and I understand the issue.

What I don't understand (and maybe I should've said this from the start) is why you are using set_index on a column that you know may have nans in it? No matter what, indices with nan in them become markedly less useful.

More specifically why do you need to be able to have an Index with nan? Are you doing operations that require you to slice with nan values or index into the DataFrame/Series with nans?

Again, what operations are you doing with/on the Index that you cannot do conveniently in a Series?

If you think a column may have nans in it, then don't use set_index. Even if you could, what are you going to do with that after you do call set_index?

@jreback
Copy link
Contributor

jreback commented Aug 30, 2014

interjecting my 2c

currently nan groups are dropped FYI (this is a separate issue) - I think mainly because it can lead to having a nan indexer

which is generally to be avoided BECAUSE of a multitude of issues when trying to index (as cpcloud points out) - wesm originally punted I this - in recent years we have allowed this more and more

so actually ok with this pr - just let's make sure the semantics are similar to what we do for nan testing (eg cpcloud points)

that said - I wouldn't take sql S gospel nor really as a recommendation

they don't have any notion of indexing at all!

it's just return the result of a query (which is not the same!)

and oftentimes DB are for different purposes that data analysis and therefor do things differently

and generally sql is neither as flexible nor idiomatic as pandas and is generally NOT a good model

@behzadnouri
Copy link
Contributor Author

again, it is about preserving the data.

Here, is an example which I am setting index on a column which has nan values:

>>> left
   jim  jolie
0    1      3
1  NaN      6
2    7      9
>>> right
   jim  joe
0    1    2
1  NaN    5
2    7    8
>>> left.join(right.set_index('jim'), on='jim')
   jim  jolie  joe
0    1      3    2
1  NaN      6    5
2    7      9    8

These are toy examples; Real world case would be a 500+ LOC script, which you either have to check for df.isnull().any() after each line of code, or hope that if something goes wrong you will see it in the output of script. If the code silently drops your data you would not see it.

@jreback jreback modified the milestones: 0.15.1, 0.15.0 Sep 14, 2014
@jreback
Copy link
Contributor

jreback commented Mar 2, 2015

@behzadnouri can you rebase this.

@behzadnouri
Copy link
Contributor Author

@jreback rebased

jreback added a commit that referenced this pull request Mar 5, 2015
BUG: reindexing a frame fails to pick nan value
@jreback jreback merged commit 1fab6fc into pandas-dev:master Mar 5, 2015
@jreback
Copy link
Contributor

jreback commented Mar 5, 2015

@behzadnouri thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants