You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Parquet format supports an ENUM type, which appears a fairly good match for Pandas' Categorical dtype.
As such, it would be nice if Pandas' to_parquet and read_parquet methods would take this into account such that this information would not go lost on (de)serialization cycles.
Feature Description
Have to_parquet and read_parquet automatically convert between Pandas Categorical and Parquet ENUM.
Alternative Solutions
doing custom casts to categorical after deserializing again
For reference, the ENUM logical type is described as:
ENUM annotates the binary primitive type and indicates that the value was converted from an enumerated type in another data model (e.g. Thrift, Avro, Protobuf). Applications using a data model lacking a native enum type should interpret ENUM annotated field as a UTF-8 encoded string.
So as a starter, I don't think it would be a good match for our categorical dtype. The ENUM seems to annotate a column for which the actual values are stored as variable length binary data. The categorical dtype in pandas is under the hood represented as an array of integer indices (pointing to a set of unique categories). Such integers are much more efficient to store than the materialized binary data.
But also, secondly, pandas uses PyArrow (and the Parquet C++ implementation that pyarrow provides bindings for) to read/write parquet files, but AFAIK Parquet C++ does not really support the ENUM logical type (on read, it will support it but it just reads it as normal binary data; and from python you can't actually write it I think).
So certainly for writing I wouldn't use ENUM, and I think with pyarrow it's also not actually possible. On the reading side, read_parquet will read it as general binary data. But if you directly want to read it as categorical dtype in pandas, pyarrow does support reading binary data as "dictionary encoded" in arrow, which will then translate to categorical dtype in pandas. See the read_dictionary keyword in https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html. You can pass this keyword to pd.read_parquet and it will be passed through to pyarrow.
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
Parquet format supports an
ENUM
type, which appears a fairly good match for Pandas'Categorical
dtype.As such, it would be nice if Pandas'
to_parquet
andread_parquet
methods would take this into account such that this information would not go lost on (de)serialization cycles.Feature Description
Have
to_parquet
andread_parquet
automatically convert between PandasCategorical
and ParquetENUM
.Alternative Solutions
doing custom casts to categorical after deserializing again
Additional Context
as suggested in #25448 (comment)
The text was updated successfully, but these errors were encountered: