-
Notifications
You must be signed in to change notification settings - Fork 269
Add support for orc format #790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -168,6 +168,7 @@ | |||||||||||||||||||||
ICEBERG_SCHEMA = b"iceberg.schema" | ||||||||||||||||||||||
# The PARQUET: in front means that it is Parquet specific, in this case the field_id | ||||||||||||||||||||||
PYARROW_PARQUET_FIELD_ID_KEY = b"PARQUET:field_id" | ||||||||||||||||||||||
PYARROW_ORC_FIELD_ID_KEY = b"iceberg.id" | ||||||||||||||||||||||
PYARROW_FIELD_DOC_KEY = b"doc" | ||||||||||||||||||||||
LIST_ELEMENT_NAME = "element" | ||||||||||||||||||||||
MAP_KEY_NAME = "key" | ||||||||||||||||||||||
|
@@ -627,6 +628,8 @@ def expression_to_pyarrow(expr: BooleanExpression) -> pc.Expression: | |||||||||||||||||||||
def _get_file_format(file_format: FileFormat, **kwargs: Dict[str, Any]) -> ds.FileFormat: | ||||||||||||||||||||||
if file_format == FileFormat.PARQUET: | ||||||||||||||||||||||
return ds.ParquetFileFormat(**kwargs) | ||||||||||||||||||||||
elif file_format == FileFormat.ORC: | ||||||||||||||||||||||
return ds.OrcFileFormat() | ||||||||||||||||||||||
else: | ||||||||||||||||||||||
raise ValueError(f"Unsupported file format: {file_format}") | ||||||||||||||||||||||
|
||||||||||||||||||||||
|
@@ -799,11 +802,12 @@ def primitive(self, primitive: pa.DataType) -> T: | |||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
def _get_field_id(field: pa.Field) -> Optional[int]: | ||||||||||||||||||||||
return ( | ||||||||||||||||||||||
int(field_id_str.decode()) | ||||||||||||||||||||||
if (field.metadata and (field_id_str := field.metadata.get(PYARROW_PARQUET_FIELD_ID_KEY))) | ||||||||||||||||||||||
else None | ||||||||||||||||||||||
) | ||||||||||||||||||||||
if field.metadata and (field_id_str := field.metadata.get(PYARROW_ORC_FIELD_ID_KEY)): | ||||||||||||||||||||||
return int(field_id_str.decode()) | ||||||||||||||||||||||
elif field.metadata and (field_id_str := field.metadata.get(PYARROW_PARQUET_FIELD_ID_KEY)): | ||||||||||||||||||||||
return int(field_id_str.decode()) | ||||||||||||||||||||||
else: | ||||||||||||||||||||||
return None | ||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
class _HasIds(PyArrowSchemaVisitor[bool]): | ||||||||||||||||||||||
|
@@ -912,6 +916,9 @@ def primitive(self, primitive: pa.DataType) -> PrimitiveType: | |||||||||||||||||||||
return TimestamptzType() | ||||||||||||||||||||||
elif primitive.tz is None: | ||||||||||||||||||||||
return TimestampType() | ||||||||||||||||||||||
if primitive.unit == "ns": | ||||||||||||||||||||||
if primitive.tz == "UTC": | ||||||||||||||||||||||
return TimestamptzType() | ||||||||||||||||||||||
Comment on lines
+919
to
+921
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nanosecond timestamp is added in version 3, which is still under development and not formally adopted. Pyiceberg does not support nanosecond yet so I think we should not add the conversion here. (and it should be converted to a separate I think you add this because arrow reads ORC timestamp as nanoseconds since ORC's timestamp types always contain nanosecond information, but we want to read this as microsecond. It seems Java side currently just treat ORC's timestamp type as We could probably fix this at arrow schema level. For example, we can add an additional conversion for iceberg-python/pyiceberg/io/pyarrow.py Lines 987 to 989 in f782c54
Changing the physical schema also ensures that the actual timestamp data is read with us unit as required by TimestampType iceberg-python/pyiceberg/io/pyarrow.py Lines 1002 to 1008 in f782c54
However, I also feel it is not the ultimate solution because we assume the unit is microsecond. When @MehulBatra @Fokko Would love to hear your thoughts on this. Please correct me if I make any mistakes about the ORC's behavior. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes @HonahX you got it right arrow is reading the ORC timestamp unit as |
||||||||||||||||||||||
elif pa.types.is_binary(primitive) or pa.types.is_large_binary(primitive): | ||||||||||||||||||||||
return BinaryType() | ||||||||||||||||||||||
elif pa.types.is_fixed_size_binary(primitive): | ||||||||||||||||||||||
|
@@ -972,8 +979,11 @@ def _task_to_table( | |||||||||||||||||||||
name_mapping: Optional[NameMapping] = None, | ||||||||||||||||||||||
) -> Optional[pa.Table]: | ||||||||||||||||||||||
_, _, path = PyArrowFileIO.parse_location(task.file.file_path) | ||||||||||||||||||||||
arrow_format = ds.ParquetFileFormat(pre_buffer=True, buffer_size=(ONE_MEGABYTE * 8)) | ||||||||||||||||||||||
with fs.open_input_file(path) as fin: | ||||||||||||||||||||||
if task.file.file_format == FileFormat.PARQUET: | ||||||||||||||||||||||
arrow_format = ds.ParquetFileFormat(pre_buffer=True, buffer_size=(ONE_MEGABYTE * 8)) | ||||||||||||||||||||||
if task.file.file_format == FileFormat.ORC: | ||||||||||||||||||||||
arrow_format = ds.OrcFileFormat() # currently ORC doesn't support any fragment scan options | ||||||||||||||||||||||
fragment = arrow_format.make_fragment(fin) | ||||||||||||||||||||||
physical_schema = fragment.physical_schema | ||||||||||||||||||||||
file_schema = pyarrow_to_schema(physical_schema, name_mapping) | ||||||||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to add a doc somewhere to mention that ORC read support requires pyarrow version >=
13.0.0
, since the orc metadata is exposed in pyarrow schema in 13.0.0 apache/arrow#35304Also, shall we check
PYARROW_PARQUET_FIELD_ID_KEY
first since parquet is the default file format?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the right approach I will promote Parquet at the top then ORC and will make necessary doc changes while releasing ORC read it will definitely help users to get started, thanks for the feedback.