-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: reindexing a frame fails to pick nan value #8108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
having Nan's in an index is very odd |
@jreback I do not think it is odd. Because a lot of times you do not really know that your data have null values, and you may do
also, regardless of
but not in
with this patch:
|
6c3a299
to
424d98e
Compare
don't merge in master branches to this, rebase and force push instead, see here: https://github.com/pydata/pandas/wiki/Using-Git |
That said, really have to think about this. The problem with multi-nan in an index is not new, and causes trememdous indexing headaches. it prob should raise if their is more than one nan. Having a column is one thing, but indexing is impossible (and in fact it raises if their are 2 or more nans IIRC now). |
It is not about multi-nan because this branch of code only happens if the index is unique. If there are multi-nan or multi-anything the code does not hit this branch. Consider this more of a fix to |
Below shows that objects work fine, even with exact same values:
|
3d4cbef
to
e0fc846
Compare
@cpcloud ok? |
i guess so. @behzadnouri your redefinition of the Something like
No, for (i = start; i < stop && i < Py_SIZE(self); i++) {
int cmp = PyObject_RichCompareBool(self->ob_item[i], v, Py_EQ);
if (cmp > 0)
return PyLong_FromSsize_t(i);
else if (cmp < 0)
return NULL;
} This line PyObject_RichCompareBool(self->ob_item[i], v, Py_EQ); short-circuits when two pointers are equal (and returns It also appears that you made the assumption that
All that said, if you feel really strongly about this and @jreback is okay with it, then merge away! |
The whole point is that the code should preserve the data as much as possible, and let the user of the library see and deal with his nan values, because a lot of times you do not know your data have null values, and that is a valuable red-flag which you wouldn't see if the code silently drops it. Currently if I have a data-frame like this:
a simple operation like below causes loss of data:
whereas, semantically same operation gives different result:
In pandas it is very convenient and easy to deal with nan/null values; you just make a call to But on the other hand, it does treat nan/null values in a way that you have to go scavenger hunt to find out why you lost one row of data as in above example. This is another example. It is at odds with how database systems do
The point of grouping null values (even though This is another example of how treating one value differently from the rest of the litter, can break the code. IMAO it is better to deviate from IEEE definition of |
@behzadnouri I'm on board with all of your points and I understand the issue. What I don't understand (and maybe I should've said this from the start) is why you are using More specifically why do you need to be able to have an Again, what operations are you doing with/on the If you think a column may have |
interjecting my 2c currently nan groups are dropped FYI (this is a separate issue) - I think mainly because it can lead to having a nan indexer which is generally to be avoided BECAUSE of a multitude of issues when trying to index (as cpcloud points out) - wesm originally punted I this - in recent years we have allowed this more and more so actually ok with this pr - just let's make sure the semantics are similar to what we do for nan testing (eg cpcloud points) that said - I wouldn't take sql S gospel nor really as a recommendation they don't have any notion of indexing at all! it's just return the result of a query (which is not the same!) and oftentimes DB are for different purposes that data analysis and therefor do things differently and generally sql is neither as flexible nor idiomatic as pandas and is generally NOT a good model |
again, it is about preserving the data. Here, is an example which I am setting index on a column which has nan values:
These are toy examples; Real world case would be a 500+ LOC script, which you either have to check for |
@behzadnouri can you rebase this. |
@jreback rebased |
BUG: reindexing a frame fails to pick nan value
@behzadnouri thanks again! |
currently
with this patch:
Also,
Float64HashTable
cannot lookupnan
values:this is fixed to:
same way that python built-in lists work: