Skip to content

Commit 7a9cc23

Browse files
committed
BUG: Fix pandas-dev#57608 : queries on categorical string columns in
HDFStore.select() return unexpected results. In function __init__() of class Selection (pandas/core/io/pytables.py), the method self.terms.evaluate() was not returning the correct value for the where condition. The issue stemmed from the function convert_value() of class BinOp (pandas/core/computation/pytables.py), where the function searchedsorted() did not return the correct index when matching the where condition in the metadata (categories table). Replacing searchsorted() with np.where() resolves this issue.
1 parent bc24e84 commit 7a9cc23

File tree

3 files changed

+22
-1
lines changed

3 files changed

+22
-1
lines changed

doc/source/whatsnew/v3.0.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -742,6 +742,7 @@ I/O
742742
- Bug in :meth:`read_stata` where the missing code for double was not recognised for format versions 105 and prior (:issue:`58149`)
743743
- Bug in :meth:`set_option` where setting the pandas option ``display.html.use_mathjax`` to ``False`` has no effect (:issue:`59884`)
744744
- Bug in :meth:`to_excel` where :class:`MultiIndex` columns would be merged to a single row when ``merge_cells=False`` is passed (:issue:`60274`)
745+
- Bug in :meth:`HDFStore.select` causing queries on categorical string columns to return unexpected results (:issue:`57608`)
745746

746747
Period
747748
^^^^^^

pandas/core/computation/pytables.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -239,7 +239,8 @@ def stringify(value):
239239
if conv_val not in metadata:
240240
result = -1
241241
else:
242-
result = metadata.searchsorted(conv_val, side="left")
242+
# Find the index of the first match of conv_val in metadata
243+
result = np.where(metadata == conv_val)[0][0]
243244
return TermValue(result, result, "integer")
244245
elif kind == "integer":
245246
try:

pandas/tests/io/pytables/test_store.py

+19
Original file line numberDiff line numberDiff line change
@@ -1106,3 +1106,22 @@ def test_store_bool_index(tmp_path, setup_path):
11061106
df.to_hdf(path, key="a")
11071107
result = read_hdf(path, "a")
11081108
tm.assert_frame_equal(expected, result)
1109+
1110+
1111+
@pytest.mark.parametrize("model", ['name', 'longname', 'verylongname'])
1112+
def test_select_categorical_string_columns(tmp_path, model):
1113+
# Corresponding to BUG: 57608
1114+
1115+
path = tmp_path / "test.h5"
1116+
1117+
models = pd.api.types.CategoricalDtype(categories=['name', 'longname', 'verylongname'])
1118+
df = pd.DataFrame({'modelId': ['name', 'longname', 'longname'],
1119+
'value': [1, 2, 3]}).astype({'modelId': models, 'value': int})
1120+
1121+
with pd.HDFStore(path, 'w') as store:
1122+
store.append('df', df, data_columns=['modelId'])
1123+
1124+
with pd.HDFStore(path, 'r') as store:
1125+
result = store.select('df', 'modelId == model')
1126+
expected = df[df['modelId'] == model]
1127+
tm.assert_frame_equal(result, expected)

0 commit comments

Comments
 (0)