ENH: support parquet's enum type using Categorical when (de)serializing #58799

kiaradlf · 2024-05-21T07:35:44Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

Parquet format supports an ENUM type, which appears a fairly good match for Pandas' Categorical dtype.
As such, it would be nice if Pandas' to_parquet and read_parquet methods would take this into account such that this information would not go lost on (de)serialization cycles.

Feature Description

Have to_parquet and read_parquet automatically convert between Pandas Categorical and Parquet ENUM.

Alternative Solutions

doing custom casts to categorical after deserializing again

Additional Context

as suggested in #25448 (comment)

The text was updated successfully, but these errors were encountered:

rhshadrach · 2024-05-24T03:15:23Z

cc @jorisvandenbossche for any thoughts.

jorisvandenbossche · 2024-05-24T07:41:04Z

For reference, the ENUM logical type is described as:

ENUM annotates the binary primitive type and indicates that the value was converted from an enumerated type in another data model (e.g. Thrift, Avro, Protobuf). Applications using a data model lacking a native enum type should interpret ENUM annotated field as a UTF-8 encoded string.

So as a starter, I don't think it would be a good match for our categorical dtype. The ENUM seems to annotate a column for which the actual values are stored as variable length binary data. The categorical dtype in pandas is under the hood represented as an array of integer indices (pointing to a set of unique categories). Such integers are much more efficient to store than the materialized binary data.

But also, secondly, pandas uses PyArrow (and the Parquet C++ implementation that pyarrow provides bindings for) to read/write parquet files, but AFAIK Parquet C++ does not really support the ENUM logical type (on read, it will support it but it just reads it as normal binary data; and from python you can't actually write it I think).

So certainly for writing I wouldn't use ENUM, and I think with pyarrow it's also not actually possible. On the reading side, read_parquet will read it as general binary data. But if you directly want to read it as categorical dtype in pandas, pyarrow does support reading binary data as "dictionary encoded" in arrow, which will then translate to categorical dtype in pandas. See the read_dictionary keyword in https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html. You can pass this keyword to pd.read_parquet and it will be passed through to pyarrow.

rhshadrach · 2024-05-27T12:58:12Z

Thanks @jorisvandenbossche - closing.

kiaradlf added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 21, 2024

rhshadrach added Categorical Categorical Data Type IO Parquet parquet, feather labels May 24, 2024

kiaradlf mentioned this issue May 24, 2024

[Parquet][Python] writing/reading parquet enum types from pyarrow apache/arrow#41811

Open

rhshadrach closed this as completed May 27, 2024

rhshadrach removed the Needs Triage Issue that has not been reviewed by a pandas team member label May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: support parquet's enum type using Categorical when (de)serializing #58799

ENH: support parquet's enum type using Categorical when (de)serializing #58799

kiaradlf commented May 21, 2024

rhshadrach commented May 24, 2024

Uh oh!

jorisvandenbossche commented May 24, 2024

Uh oh!

rhshadrach commented May 27, 2024

Uh oh!

Uh oh!

ENH: support parquet's enum type using Categorical when (de)serializing #58799

ENH: support parquet's enum type using Categorical when (de)serializing #58799

Comments

kiaradlf commented May 21, 2024

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

rhshadrach commented May 24, 2024

Uh oh!

jorisvandenbossche commented May 24, 2024

Uh oh!

rhshadrach commented May 27, 2024

Uh oh!