Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: queries on categorical string columns in read_hdf return unexpected results #39189

Closed
2 of 3 tasks
travispavletich opened this issue Jan 15, 2021 · 2 comments · Fixed by #39420
Closed
2 of 3 tasks
Assignees
Labels
Bug Categorical Categorical Data Type IO HDF5 read_hdf, HDFStore
Milestone

Comments

@travispavletich
Copy link

travispavletich commented Jan 15, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

h5_path = 'test.h5'
df = pd.DataFrame({'col': ['a', 'b', 's']})
categorical_values = list(sorted(list(df.col.unique()))) # sorted because of https://github.com/pandas-dev/pandas/issues/16623
max_widths = {'col': 1} # will also set this column to be a "data column"
df.col = df.col.astype('category')
df.col.cat.set_categories(categorical_values, inplace=True)
df.to_hdf(h5_path, 'main', mode='a', format='table', append=True, min_itemsize=max_widths)

read_df = pd.read_hdf(h5_path, where='col == "q"') #returns df with index=2, col = s 
h5_path = 'test.h5'
df = pd.DataFrame({'col': ['Word', 'Test']})
categorical_values = list(sorted(list(df.col.unique())))
max_widths = {'col': 4}
df.col = df.col.astype('category')
df.col.cat.set_categories(categorical_values, inplace=True)
df.to_hdf(h5_path, 'main', mode='a', format='table', append=True, min_itemsize=max_widths)

read_df = pd.read_hdf(h5_path, where='col == "W"') #returns df with index=0, col=Word
read_df = pd.read_hdf(h5_path, where='col == "T"') # returns empty df (as expected)

Problem description

Using the where clause for on disk hdf queries appears to give incorrect results sometimes. From what I have tested, this appears to only happen for columns that are both string based and categoricals. This is important because the output is completely inaccurate and makes this feature mostly unusable for these column types. I should note that I have not seen issues with querying for values that are present in the dataframe however.

Expected Output

For all read_hdf calls, the expected output is an empty dataframe.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 3e89b4c
python : 3.7.9.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-514.21.2.el7.x86_64
Version : #1 SMP Tue Jun 20 12:24:47 UTC 2017
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.0
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 50.3.0.post20201006
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@travispavletich travispavletich added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 15, 2021
@jake-arkinstall
Copy link

Can confirm. When I query my datetime index on a 3 million row table, I get different numbers of results every time.

                               DateTime  ...  wom
index                                    ...     
2020-12-07 00:00:01 2020-12-07 00:00:01  ...    1
2020-12-07 00:04:53 2020-12-07 00:04:53  ...    1
2020-12-07 00:15:45 2020-12-07 00:15:45  ...    1
2020-12-07 00:27:02 2020-12-07 00:27:02  ...    1
2020-12-07 00:32:22 2020-12-07 00:32:22  ...    1
...                                 ...  ...  ...
2021-01-21 15:45:33 2021-01-21 15:45:33  ...    3
2021-01-21 15:48:30 2021-01-21 15:48:30  ...    3
2021-01-21 15:52:57 2021-01-21 15:52:57  ...    3
2021-01-21 15:56:41 2021-01-21 15:56:41  ...    3
2021-01-21 16:00:01 2021-01-21 16:00:01  ...    3

[40328 rows x 167 columns]
>>> pandas.read_hdf(store, "updates", where="index > '2020-12-07'")
                               DateTime  ...  wom
index                                    ...     
2020-12-07 00:00:01 2020-12-07 00:00:01  ...    1
2020-12-07 00:04:53 2020-12-07 00:04:53  ...    1
2020-12-07 00:15:45 2020-12-07 00:15:45  ...    1
2020-12-07 00:27:02 2020-12-07 00:27:02  ...    1
2020-12-07 00:32:22 2020-12-07 00:32:22  ...    1
...                                 ...  ...  ...
2021-01-21 15:45:33 2021-01-21 15:45:33  ...    3
2021-01-21 15:48:30 2021-01-21 15:48:30  ...    3
2021-01-21 15:52:57 2021-01-21 15:52:57  ...    3
2021-01-21 15:56:41 2021-01-21 15:56:41  ...    3
2021-01-21 16:00:01 2021-01-21 16:00:01  ...    3

[39649 rows x 167 columns]
>>> pandas.read_hdf(store, "updates", where="index > '2020-12-07'")
                               DateTime  ...  wom
index                                    ...     
2020-12-07 00:00:01 2020-12-07 00:00:01  ...    1
2020-12-07 00:04:53 2020-12-07 00:04:53  ...    1
2020-12-07 00:15:45 2020-12-07 00:15:45  ...    1
2020-12-07 00:27:02 2020-12-07 00:27:02  ...    1
2020-12-07 00:32:22 2020-12-07 00:32:22  ...    1
...                                 ...  ...  ...
2021-01-21 15:07:41 2021-01-21 15:07:41  ...    3
2021-01-21 15:08:21 2021-01-21 15:08:21  ...    3
2021-01-21 15:09:09 2021-01-21 15:09:09  ...    3
2021-01-21 15:10:01 2021-01-21 15:10:01  ...    3
2021-01-21 15:10:49 2021-01-21 15:10:49  ...    3

[22559 rows x 167 columns]
>>> pandas.read_hdf(store, "updates", where="index > '2020-12-07'")
                               DateTime  ...  wom
index                                    ...     
2020-12-07 00:00:01 2020-12-07 00:00:01  ...    1
2020-12-07 00:04:53 2020-12-07 00:04:53  ...    1
2020-12-07 00:15:45 2020-12-07 00:15:45  ...    1
2020-12-07 00:27:02 2020-12-07 00:27:02  ...    1
2020-12-07 00:32:22 2020-12-07 00:32:22  ...    1
...                                 ...  ...  ...
2021-01-21 15:45:33 2021-01-21 15:45:33  ...    3
2021-01-21 15:48:30 2021-01-21 15:48:30  ...    3
2021-01-21 15:52:57 2021-01-21 15:52:57  ...    3
2021-01-21 15:56:41 2021-01-21 15:56:41  ...    3
2021-01-21 16:00:01 2021-01-21 16:00:01  ...    3

[41686 rows x 167 columns]
>>> pandas.read_hdf(store, "updates", where="index > '2020-12-07'")
                               DateTime  ...  wom
index                                    ...     
2020-12-07 00:00:01 2020-12-07 00:00:01  ...    1
2020-12-07 00:04:53 2020-12-07 00:04:53  ...    1
2020-12-07 00:15:45 2020-12-07 00:15:45  ...    1
2020-12-07 00:27:02 2020-12-07 00:27:02  ...    1
2020-12-07 00:32:22 2020-12-07 00:32:22  ...    1
...                                 ...  ...  ...
2021-01-21 15:45:33 2021-01-21 15:45:33  ...    3
2021-01-21 15:48:30 2021-01-21 15:48:30  ...    3
2021-01-21 15:52:57 2021-01-21 15:52:57  ...    3
2021-01-21 15:56:41 2021-01-21 15:56:41  ...    3
2021-01-21 16:00:01 2021-01-21 16:00:01  ...    3

[41492 rows x 167 columns]

Also strange is that I can't read this HDF with the filename anymore. I need to use a HDFStore - trying to use pandas.read_hdf(filename, key, ...) gives an error saying that it's unable to write - but I'm trying to read.

>>> df = pandas.read_hdf("data.hdf5", "updates", start=-54975)
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/pandas/io/pytables.py", line 425, in read_hdf
    return store.select(
  File "/usr/lib/python3.9/site-packages/pandas/io/pytables.py", line 845, in select
    return it.get_result()
  File "/usr/lib/python3.9/site-packages/pandas/io/pytables.py", line 1914, in get_result
    self.close()
  File "/usr/lib/python3.9/site-packages/pandas/io/pytables.py", line 1890, in close
    self.store.close()
  File "/usr/lib/python3.9/site-packages/pandas/io/pytables.py", line 717, in close
    self._handle.close()
  File "/usr/lib/python3.9/site-packages/tables/file.py", line 2723, in close
    self.root._f_close()
  File "/usr/lib/python3.9/site-packages/tables/group.py", line 933, in _f_close
    self._g_close_descendents()
  File "/usr/lib/python3.9/site-packages/tables/group.py", line 897, in _g_close_descendents
    node_manager.close_subtree(self._v_pathname)
  File "/usr/lib/python3.9/site-packages/tables/file.py", line 538, in close_subtree
    self._close_nodes(paths, cache.pop)
  File "/usr/lib/python3.9/site-packages/tables/file.py", line 513, in _close_nodes
    node._f_close()
  File "/usr/lib/python3.9/site-packages/tables/table.py", line 2961, in _f_close
    self.flush()
  File "/usr/lib/python3.9/site-packages/tables/table.py", line 2898, in flush
    rowsadded = self.flush_rows_to_index(_lastrow=True)
  File "/usr/lib/python3.9/site-packages/tables/table.py", line 2521, in flush_rows_to_index
    rowsadded = self._add_rows_to_index(
  File "/usr/lib/python3.9/site-packages/tables/table.py", line 2550, in _add_rows_to_index
    index.append_last_row(
  File "/usr/lib/python3.9/site-packages/tables/index.py", line 734, in append_last_row
    larr, arr, idx = self.initial_append(xarr, nrows, reduction)
  File "/usr/lib/python3.9/site-packages/tables/index.py", line 626, in initial_append
    self._v_attrs.is_csi = False
  File "/usr/lib/python3.9/site-packages/tables/attributeset.py", line 466, in __setattr__
    nodefile._check_writable()
  File "/usr/lib/python3.9/site-packages/tables/file.py", line 2167, in _check_writable
    raise FileModeError("the file is not writable")
tables.exceptions.FileModeError: the file is not writable

Also on pandas 1.2.0, tables 3.6.1, but on python 3.9.1.

@nofarm3
Copy link
Contributor

nofarm3 commented Jan 22, 2021

take

nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 23, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 23, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 23, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 23, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 23, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 23, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 23, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 23, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 23, 2021
@simonjayhawkins simonjayhawkins added Categorical Categorical Data Type IO HDF5 read_hdf, HDFStore and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 26, 2021
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Jan 26, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 26, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 30, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 30, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 30, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 30, 2021
nofarm3 pushed a commit to nofarm3/pandas that referenced this issue Jan 30, 2021
@jreback jreback modified the milestones: Contributions Welcome, 1.3 Jan 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type IO HDF5 read_hdf, HDFStore
Projects
None yet
5 participants