Skip to content

ENH: support parquet's enum type using Categorical when (de)serializing #58799

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 3 tasks
kiaradlf opened this issue May 21, 2024 · 3 comments
Closed
1 of 3 tasks
Labels
Categorical Categorical Data Type Enhancement IO Parquet parquet, feather

Comments

@kiaradlf
Copy link

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Parquet format supports an ENUM type, which appears a fairly good match for Pandas' Categorical dtype.
As such, it would be nice if Pandas' to_parquet and read_parquet methods would take this into account such that this information would not go lost on (de)serialization cycles.

Feature Description

Have to_parquet and read_parquet automatically convert between Pandas Categorical and Parquet ENUM.

Alternative Solutions

doing custom casts to categorical after deserializing again

Additional Context

as suggested in #25448 (comment)

@kiaradlf kiaradlf added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 21, 2024
@rhshadrach
Copy link
Member

cc @jorisvandenbossche for any thoughts.

@rhshadrach rhshadrach added Categorical Categorical Data Type IO Parquet parquet, feather labels May 24, 2024
@jorisvandenbossche
Copy link
Member

For reference, the ENUM logical type is described as:

ENUM annotates the binary primitive type and indicates that the value was converted from an enumerated type in another data model (e.g. Thrift, Avro, Protobuf). Applications using a data model lacking a native enum type should interpret ENUM annotated field as a UTF-8 encoded string.

So as a starter, I don't think it would be a good match for our categorical dtype. The ENUM seems to annotate a column for which the actual values are stored as variable length binary data. The categorical dtype in pandas is under the hood represented as an array of integer indices (pointing to a set of unique categories). Such integers are much more efficient to store than the materialized binary data.

But also, secondly, pandas uses PyArrow (and the Parquet C++ implementation that pyarrow provides bindings for) to read/write parquet files, but AFAIK Parquet C++ does not really support the ENUM logical type (on read, it will support it but it just reads it as normal binary data; and from python you can't actually write it I think).

So certainly for writing I wouldn't use ENUM, and I think with pyarrow it's also not actually possible. On the reading side, read_parquet will read it as general binary data. But if you directly want to read it as categorical dtype in pandas, pyarrow does support reading binary data as "dictionary encoded" in arrow, which will then translate to categorical dtype in pandas. See the read_dictionary keyword in https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html. You can pass this keyword to pd.read_parquet and it will be passed through to pyarrow.

@rhshadrach
Copy link
Member

Thanks @jorisvandenbossche - closing.

@rhshadrach rhshadrach removed the Needs Triage Issue that has not been reviewed by a pandas team member label May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Enhancement IO Parquet parquet, feather
Projects
None yet
Development

No branches or pull requests

3 participants