BUG: Parquet roundtrip fails with numerical categorical dtype #60491

adrienpacifico · 2024-12-04T12:06:09Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

>>> df=pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df = df.astype({'A':'category'})
>>> print(df.dtypes)
A    category
B       int64
dtype: object
>>> df.to_parquet('test.parquet')
>>> df_roundtrip = pd.read_parquet('test.parquet')
>>> print(df_roundtrip.dtypes)
A    int64
B    int64
dtype: object
>>> assert df_roundtrip.dtypes.equals(df.dtypes)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError

Issue Description

Roundtrip does not work.

Expected Behavior

df_roundtrip has the same dtypes as df.dtypes

Hot-Fix

df=pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df = df.astype({'A':'str'}).astype({'A':'category'})
print(df.dtypes)
df.to_parquet('test.parquet')
df_roundtrip = pd.read_parquet('test.parquet')
print(df_roundtrip.dtypes)

assert df_roundtrip.dtypes.equals(df.dtypes)

Installed Versions

INSTALLED VERSIONS

commit : 0691c5c
python : 3.12.2
python-bits : 64
OS : Darwin
OS-release : 23.5.0
Version : Darwin Kernel Version 23.5.0: Wed May 1 20:16:51 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T8103
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : fr_FR.UTF-8
LOCALE : fr_FR.UTF-8

pandas : 2.2.3
numpy : 2.0.2
pytz : 2024.2
dateutil : 2.9.0.post0
pip : 24.0
Cython : None
sphinx : None
IPython : 8.24.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : 1.4.2
dataframe-api-compat : None
fastparquet : 2024.11.0
fsspec : 2024.10.0
html5lib : 1.1
hypothesis : 6.122.1
gcsfs : 2024.10.0
jinja2 : 3.1.4
lxml.etree : 5.3.0
matplotlib : 3.9.3
numba : 0.60.0
numexpr : 2.10.2
odfpy : None
openpyxl : 3.1.5
pandas_gbq : 0.24.0
psycopg2 : 2.9.10
pymysql : 1.4.6
pyarrow : 18.1.0
pyreadstat : 1.2.8
pytest : 8.3.4
python-calamine : None
pyxlsb : 1.0.10
s3fs : 2024.10.0
scipy : 1.14.1
sqlalchemy : 2.0.36
tables : 3.10.1
tabulate : 0.9.0
xarray : 2024.11.0
xlrd : 2.0.1
xlsxwriter : 3.2.0
zstandard : 0.23.0
tzdata : 2024.2
qtpy : 2.4.2
pyqt5 : None

The text was updated successfully, but these errors were encountered:

ppsmoraes · 2024-12-05T01:10:28Z

I've tried using the fastparquet engine, and it seems to work. Whatever the problem is, it lies with the way the pyarrow engine reads the Parquet file.

Here is the code example:

import pandas as pd

df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df = df.astype({'A':'category'})
print(df.dtypes)
# A    category
# B       int64
# dtype: object
df.to_parquet('test.parquet')

df_roundtrip = pd.read_parquet('test.parquet')
print(df_roundtrip.dtypes)
# A    int64
# B    int64
# dtype: object

df_roundtrip_fp = pd.read_parquet('test.parquet', engine='fastparquet')
print(df_roundtrip_fp.dtypes)
# A    category
# B       int64
# dtype: object

result = df_roundtrip.equals(df)
print(result)
# False

result_fp = df_roundtrip_fp.equals(df)
print(result_fp)
# True

rhshadrach · 2024-12-20T15:17:21Z

Agreed categorical-integers (and other dtypes) should roundtrip alongside categorial-strings. Further investigations and PRs to fix are welcome!

tsafacjo · 2025-02-01T23:32:19Z

can I take it ?

myenugula · 2025-03-29T18:48:09Z

take

myenugula · 2025-04-05T14:52:56Z

Seems that the core issue is within Pyarrow, not Pandas.

When a DataFrame with a numeric categorical column is saved to Parquet and read back using the PyArrow engine (the default), the categorical dtype is lost and the column is converted back to a regular numeric type.
The issue is in how PyArrow handles dictionary-encoded columns during Parquet roundtrips. PyArrow recognizes dictionary-encoded string columns as categorical, but doesn't preserve the categorical information for numeric dictionary-encoded columns when reading from Parquet.

The key behavior difference is in how the PyArrow engine handles dictionary encoding during read:

For string categoricals: Dictionary encoding is preserved as categorical dtype
For numeric categoricals: Dictionary encoding is unpacked to numeric type, losing categorical metadata

adrienpacifico added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 4, 2024

adrienpacifico changed the title ~~BUG: parquet roundtrip does not work with numerical categorical dtype~~ BUG: Parquet roundtrip fails with numerical categorical dtype Dec 4, 2024

rhshadrach removed the Needs Triage Issue that has not been reviewed by a pandas team member label Dec 20, 2024

rhshadrach added Categorical Categorical Data Type IO Parquet parquet, feather labels Dec 20, 2024

github-actions bot assigned myenugula Mar 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Parquet roundtrip fails with numerical categorical dtype #60491

BUG: Parquet roundtrip fails with numerical categorical dtype #60491

adrienpacifico commented Dec 4, 2024 •

edited

Loading

INSTALLED VERSIONS

ppsmoraes commented Dec 5, 2024 •

edited

Loading

rhshadrach commented Dec 20, 2024

tsafacjo commented Feb 1, 2025

myenugula commented Mar 29, 2025

myenugula commented Apr 5, 2025

BUG: Parquet roundtrip fails with numerical categorical dtype #60491

BUG: Parquet roundtrip fails with numerical categorical dtype #60491

Comments

adrienpacifico commented Dec 4, 2024 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Hot-Fix

Installed Versions

INSTALLED VERSIONS

ppsmoraes commented Dec 5, 2024 • edited Loading

rhshadrach commented Dec 20, 2024

tsafacjo commented Feb 1, 2025

myenugula commented Mar 29, 2025

myenugula commented Apr 5, 2025

adrienpacifico commented Dec 4, 2024 •

edited

Loading

ppsmoraes commented Dec 5, 2024 •

edited

Loading