BUG: queries on categorical string columns in read_hdf return unexpected results #39189

travispavletich · 2021-01-15T15:57:44Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

h5_path = 'test.h5'
df = pd.DataFrame({'col': ['a', 'b', 's']})
categorical_values = list(sorted(list(df.col.unique()))) # sorted because of https://github.com/pandas-dev/pandas/issues/16623
max_widths = {'col': 1} # will also set this column to be a "data column"
df.col = df.col.astype('category')
df.col.cat.set_categories(categorical_values, inplace=True)
df.to_hdf(h5_path, 'main', mode='a', format='table', append=True, min_itemsize=max_widths)

read_df = pd.read_hdf(h5_path, where='col == "q"') #returns df with index=2, col = s

h5_path = 'test.h5'
df = pd.DataFrame({'col': ['Word', 'Test']})
categorical_values = list(sorted(list(df.col.unique())))
max_widths = {'col': 4}
df.col = df.col.astype('category')
df.col.cat.set_categories(categorical_values, inplace=True)
df.to_hdf(h5_path, 'main', mode='a', format='table', append=True, min_itemsize=max_widths)

read_df = pd.read_hdf(h5_path, where='col == "W"') #returns df with index=0, col=Word
read_df = pd.read_hdf(h5_path, where='col == "T"') # returns empty df (as expected)

Problem description

Using the where clause for on disk hdf queries appears to give incorrect results sometimes. From what I have tested, this appears to only happen for columns that are both string based and categoricals. This is important because the output is completely inaccurate and makes this feature mostly unusable for these column types. I should note that I have not seen issues with querying for values that are present in the dataframe however.

Expected Output

For all read_hdf calls, the expected output is an empty dataframe.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 3e89b4c
python : 3.7.9.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-514.21.2.el7.x86_64
Version : #1 SMP Tue Jun 20 12:24:47 UTC 2017
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.0
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 50.3.0.post20201006
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

jake-arkinstall · 2021-01-22T18:51:51Z

Can confirm. When I query my datetime index on a 3 million row table, I get different numbers of results every time.

                               DateTime  ...  wom
index                                    ...     
2020-12-07 00:00:01 2020-12-07 00:00:01  ...    1
2020-12-07 00:04:53 2020-12-07 00:04:53  ...    1
2020-12-07 00:15:45 2020-12-07 00:15:45  ...    1
2020-12-07 00:27:02 2020-12-07 00:27:02  ...    1
2020-12-07 00:32:22 2020-12-07 00:32:22  ...    1
...                                 ...  ...  ...
2021-01-21 15:45:33 2021-01-21 15:45:33  ...    3
2021-01-21 15:48:30 2021-01-21 15:48:30  ...    3
2021-01-21 15:52:57 2021-01-21 15:52:57  ...    3
2021-01-21 15:56:41 2021-01-21 15:56:41  ...    3
2021-01-21 16:00:01 2021-01-21 16:00:01  ...    3

[40328 rows x 167 columns]
>>> pandas.read_hdf(store, "updates", where="index > '2020-12-07'")
                               DateTime  ...  wom
index                                    ...     
2020-12-07 00:00:01 2020-12-07 00:00:01  ...    1
2020-12-07 00:04:53 2020-12-07 00:04:53  ...    1
2020-12-07 00:15:45 2020-12-07 00:15:45  ...    1
2020-12-07 00:27:02 2020-12-07 00:27:02  ...    1
2020-12-07 00:32:22 2020-12-07 00:32:22  ...    1
...                                 ...  ...  ...
2021-01-21 15:45:33 2021-01-21 15:45:33  ...    3
2021-01-21 15:48:30 2021-01-21 15:48:30  ...    3
2021-01-21 15:52:57 2021-01-21 15:52:57  ...    3
2021-01-21 15:56:41 2021-01-21 15:56:41  ...    3
2021-01-21 16:00:01 2021-01-21 16:00:01  ...    3

[39649 rows x 167 columns]
>>> pandas.read_hdf(store, "updates", where="index > '2020-12-07'")
                               DateTime  ...  wom
index                                    ...     
2020-12-07 00:00:01 2020-12-07 00:00:01  ...    1
2020-12-07 00:04:53 2020-12-07 00:04:53  ...    1
2020-12-07 00:15:45 2020-12-07 00:15:45  ...    1
2020-12-07 00:27:02 2020-12-07 00:27:02  ...    1
2020-12-07 00:32:22 2020-12-07 00:32:22  ...    1
...                                 ...  ...  ...
2021-01-21 15:07:41 2021-01-21 15:07:41  ...    3
2021-01-21 15:08:21 2021-01-21 15:08:21  ...    3
2021-01-21 15:09:09 2021-01-21 15:09:09  ...    3
2021-01-21 15:10:01 2021-01-21 15:10:01  ...    3
2021-01-21 15:10:49 2021-01-21 15:10:49  ...    3

[22559 rows x 167 columns]
>>> pandas.read_hdf(store, "updates", where="index > '2020-12-07'")
                               DateTime  ...  wom
index                                    ...     
2020-12-07 00:00:01 2020-12-07 00:00:01  ...    1
2020-12-07 00:04:53 2020-12-07 00:04:53  ...    1
2020-12-07 00:15:45 2020-12-07 00:15:45  ...    1
2020-12-07 00:27:02 2020-12-07 00:27:02  ...    1
2020-12-07 00:32:22 2020-12-07 00:32:22  ...    1
...                                 ...  ...  ...
2021-01-21 15:45:33 2021-01-21 15:45:33  ...    3
2021-01-21 15:48:30 2021-01-21 15:48:30  ...    3
2021-01-21 15:52:57 2021-01-21 15:52:57  ...    3
2021-01-21 15:56:41 2021-01-21 15:56:41  ...    3
2021-01-21 16:00:01 2021-01-21 16:00:01  ...    3

[41686 rows x 167 columns]
>>> pandas.read_hdf(store, "updates", where="index > '2020-12-07'")
                               DateTime  ...  wom
index                                    ...     
2020-12-07 00:00:01 2020-12-07 00:00:01  ...    1
2020-12-07 00:04:53 2020-12-07 00:04:53  ...    1
2020-12-07 00:15:45 2020-12-07 00:15:45  ...    1
2020-12-07 00:27:02 2020-12-07 00:27:02  ...    1
2020-12-07 00:32:22 2020-12-07 00:32:22  ...    1
...                                 ...  ...  ...
2021-01-21 15:45:33 2021-01-21 15:45:33  ...    3
2021-01-21 15:48:30 2021-01-21 15:48:30  ...    3
2021-01-21 15:52:57 2021-01-21 15:52:57  ...    3
2021-01-21 15:56:41 2021-01-21 15:56:41  ...    3
2021-01-21 16:00:01 2021-01-21 16:00:01  ...    3

[41492 rows x 167 columns]

Also strange is that I can't read this HDF with the filename anymore. I need to use a HDFStore - trying to use pandas.read_hdf(filename, key, ...) gives an error saying that it's unable to write - but I'm trying to read.

>>> df = pandas.read_hdf("data.hdf5", "updates", start=-54975)
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/pandas/io/pytables.py", line 425, in read_hdf
    return store.select(
  File "/usr/lib/python3.9/site-packages/pandas/io/pytables.py", line 845, in select
    return it.get_result()
  File "/usr/lib/python3.9/site-packages/pandas/io/pytables.py", line 1914, in get_result
    self.close()
  File "/usr/lib/python3.9/site-packages/pandas/io/pytables.py", line 1890, in close
    self.store.close()
  File "/usr/lib/python3.9/site-packages/pandas/io/pytables.py", line 717, in close
    self._handle.close()
  File "/usr/lib/python3.9/site-packages/tables/file.py", line 2723, in close
    self.root._f_close()
  File "/usr/lib/python3.9/site-packages/tables/group.py", line 933, in _f_close
    self._g_close_descendents()
  File "/usr/lib/python3.9/site-packages/tables/group.py", line 897, in _g_close_descendents
    node_manager.close_subtree(self._v_pathname)
  File "/usr/lib/python3.9/site-packages/tables/file.py", line 538, in close_subtree
    self._close_nodes(paths, cache.pop)
  File "/usr/lib/python3.9/site-packages/tables/file.py", line 513, in _close_nodes
    node._f_close()
  File "/usr/lib/python3.9/site-packages/tables/table.py", line 2961, in _f_close
    self.flush()
  File "/usr/lib/python3.9/site-packages/tables/table.py", line 2898, in flush
    rowsadded = self.flush_rows_to_index(_lastrow=True)
  File "/usr/lib/python3.9/site-packages/tables/table.py", line 2521, in flush_rows_to_index
    rowsadded = self._add_rows_to_index(
  File "/usr/lib/python3.9/site-packages/tables/table.py", line 2550, in _add_rows_to_index
    index.append_last_row(
  File "/usr/lib/python3.9/site-packages/tables/index.py", line 734, in append_last_row
    larr, arr, idx = self.initial_append(xarr, nrows, reduction)
  File "/usr/lib/python3.9/site-packages/tables/index.py", line 626, in initial_append
    self._v_attrs.is_csi = False
  File "/usr/lib/python3.9/site-packages/tables/attributeset.py", line 466, in __setattr__
    nodefile._check_writable()
  File "/usr/lib/python3.9/site-packages/tables/file.py", line 2167, in _check_writable
    raise FileModeError("the file is not writable")
tables.exceptions.FileModeError: the file is not writable

Also on pandas 1.2.0, tables 3.6.1, but on python 3.9.1.

nofarm3 · 2021-01-22T19:08:42Z

take

travispavletich added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 15, 2021

github-actions bot assigned nofarm3 Jan 22, 2021

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 23, 2021

BUG: fix case of a category value which isn't exists (pandas-dev#39189)

bb2d803

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 23, 2021

BUG: add UT to conver_value for this use case (pandas-dev#39189)

f9be625

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 23, 2021

BUG: change style with pre-commit (pandas-dev#39189)

aa90441

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 23, 2021

BUG: add a whatsnew record (pandas-dev#39189)

e8ca3fc

nofarm3 mentioned this issue Jan 23, 2021

BUG: read_hdf bad filtering in case of categorical string columns #39351

Closed

4 tasks

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 23, 2021

BUG: check for tests (pandas-dev#39189)

0cb8ad7

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 23, 2021

BUG: remove spaces (pandas-dev#39189)

8284e0b

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 23, 2021

BUG: remove whatsnew (pandas-dev#39189)

9773aaa

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 23, 2021

BUG: remove tests(pandas-dev#39189)

4281ef0

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 23, 2021

BUG: add whats new (pandas-dev#39189)

7178757

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 23, 2021

BUG: check tests (pandas-dev#39189)

ca9420e

simonjayhawkins added Categorical Categorical Data Type IO HDF5 read_hdf, HDFStore and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 26, 2021

simonjayhawkins added this to the Contributions Welcome milestone Jan 26, 2021

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021

BUG: update tests (pandas-dev#39189)

f61b7c5

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021

BUG: update after precommit (pandas-dev#39189)

8c3b3b6

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021

BUG: update after precommit (pandas-dev#39189)

4e3bce2

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021

BUG: fix case of a category value which isn't exists (pandas-dev#39189)

558a585

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021

BUG: add UT to conver_value for this use case (pandas-dev#39189)

adfe600

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021

BUG: change style with pre-commit (pandas-dev#39189)

63815c7

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021

BUG: add a whatsnew record (pandas-dev#39189)

74c687a

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021

BUG: check for tests (pandas-dev#39189)

3023fc0

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021

BUG: remove spaces (pandas-dev#39189)

f917ba9

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021

BUG: remove whatsnew (pandas-dev#39189)

0abe192

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021

BUG: remove tests(pandas-dev#39189)

1b959ee

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021

BUG: add whats new (pandas-dev#39189)

4de349f

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021

BUG: check tests (pandas-dev#39189)

d7a3ef6

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021

BUG: update tests (pandas-dev#39189)

eb8cd5a

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021

BUG: update after precommit (pandas-dev#39189)

235d05e

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021

BUG: update after precommit (pandas-dev#39189)

73541ff

nofarm3 mentioned this issue Jan 26, 2021

Read hdf returns unexpected values for categorical #39420

Merged

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 30, 2021

BUG: change test location (pandas-dev#39189)

017e47a

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 30, 2021

BUG: remove import (pandas-dev#39189)

d67ff95

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 30, 2021

BUG: remove import (pandas-dev#39189)

37eef60

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 30, 2021

BUG: remove list() before sorted() (pandas-dev#39189)

b3565af

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 30, 2021

BUG: remove list() in sorted() (pandas-dev#39189)

5af7c04

jreback modified the milestones: Contributions Welcome, 1.3 Jan 30, 2021

jreback closed this as completed in #39420 Jan 30, 2021

orionbelt mentioned this issue Feb 25, 2024

BUG: queries on categorical string columns in HDFStore.select() return unexpected results #57608

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: queries on categorical string columns in read_hdf return unexpected results #39189

BUG: queries on categorical string columns in read_hdf return unexpected results #39189

travispavletich commented Jan 15, 2021 •

edited

Loading

INSTALLED VERSIONS

jake-arkinstall commented Jan 22, 2021

Uh oh!

nofarm3 commented Jan 22, 2021

Uh oh!

Uh oh!

BUG: queries on categorical string columns in read_hdf return unexpected results #39189

BUG: queries on categorical string columns in read_hdf return unexpected results #39189

Comments

travispavletich commented Jan 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jake-arkinstall commented Jan 22, 2021

Uh oh!

nofarm3 commented Jan 22, 2021

Uh oh!

travispavletich commented Jan 15, 2021 •

edited

Loading

Output of `pd.show_versions()`