Skip to content

Read hdf returns unexpected values for categorical #39420

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

nofarm3
Copy link
Contributor

@nofarm3 nofarm3 commented Jan 26, 2021

This bug happens when filtering on categorical string columns and choose a value which doesn't exist in the column.
Instead of returning an empty dataframe, we get some records.
It happens because of the usage in numpy searchsorted(v, side="left") that find indices where elements should be inserted to maintain order (and not 0 in case that the value doesn't exist), like was assumed in the code.
I changed it to first the for the value, and use searchsorted only if value exists, I also added a test for this specific use case.
I think in the long run, maybe we should refactor this area in the code since one function covers multiple use cases which makes it more complex to test.

In addition, I moved the logic to a new method to keep the single-responsibility principle and to make it easier to test.

…es-for-categorical' into read-hdf-returns-unexpected-values-for-categorical
@jreback
Copy link
Contributor

jreback commented Jan 26, 2021

note that you don't need to open a new PR, you can simply push to the original one. since you already did, then ok.

@jreback jreback added the IO HDF5 read_hdf, HDFStore label Jan 26, 2021
@jreback jreback added the Categorical Categorical Data Type label Jan 26, 2021
@MarcoGorelli MarcoGorelli self-requested a review January 30, 2021 16:03
@nofarm3
Copy link
Contributor Author

nofarm3 commented Jan 30, 2021

I think the failures here are not related to my changes.

@nofarm3 nofarm3 requested a review from jreback January 30, 2021 17:27
@jreback jreback added this to the 1.3 milestone Jan 30, 2021
@jreback jreback merged commit 0b16fb3 into pandas-dev:master Jan 30, 2021
@jreback
Copy link
Contributor

jreback commented Jan 30, 2021

thanks @nofarm3

@nofarm3 nofarm3 deleted the read-hdf-returns-unexpected-values-for-categorical branch January 31, 2021 06:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: queries on categorical string columns in read_hdf return unexpected results
3 participants