-
Notifications
You must be signed in to change notification settings - Fork 269
Add support for orc format #790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
MehulBatra
commented
Jun 3, 2024
•
edited
Loading
edited
- Ability to read orc file format based Iceberg Table
- Ability to create orc file format based Iceberg Table
- Unit Test
- Integration Test for Reads
- Integration Test for Writes
Hi @Fokko and @HonahX Would love Some guidance on:
|
I believe we need to make a change in this write_file method to support ORC writes, as the link goes
at the end it is called by the user, please correct me if I am going towards the wrong direction iceberg-python/pyiceberg/io/pyarrow.py Line 1788 in a110368
|
Hi @MehulBatra. Thanks for taking this! It looks like a great start.
Yes, I think this is the right place to add the ORC write logic. As you have noticed, the datafile format is controlled by the table property We can add the property in iceberg-python/pyiceberg/table/__init__.py Lines 206 to 211 in c4feda5
and doc it here: iceberg-python/mkdocs/docs/configuration.md Lines 53 to 63 in 94e8a98
In the iceberg-python/pyiceberg/io/pyarrow.py Lines 1674 to 1698 in e61ef57
(we can make statistics collection as a follow-up feature since most statistics fields are optional) |
Thanks, @HonahX for the feedback, I will consider all this while moving forward! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some comments for the read side.
We may try to merge the read support first and make write support a separate PR. WDYT?
if (field.metadata and (field_id_str := field.metadata.get(PYARROW_PARQUET_FIELD_ID_KEY))) | ||
else None | ||
) | ||
if field.metadata and (field_id_str := field.metadata.get(PYARROW_ORC_FIELD_ID_KEY)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to add a doc somewhere to mention that ORC read support requires pyarrow version >= 13.0.0
, since the orc metadata is exposed in pyarrow schema in 13.0.0 apache/arrow#35304
Also, shall we check PYARROW_PARQUET_FIELD_ID_KEY
first since parquet is the default file format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the right approach I will promote Parquet at the top then ORC and will make necessary doc changes while releasing ORC read it will definitely help users to get started, thanks for the feedback.
if primitive.unit == "ns": | ||
if primitive.tz == "UTC": | ||
return TimestamptzType() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nanosecond timestamp is added in version 3, which is still under development and not formally adopted. Pyiceberg does not support nanosecond yet so I think we should not add the conversion here. (and it should be converted to a separate TimestampNanoType
in the future: https://iceberg.apache.org/spec/#primitive-types )
I think you add this because arrow reads ORC timestamp as nanoseconds since ORC's timestamp types always contain nanosecond information, but we want to read this as microsecond.
It seems Java side currently just treat ORC's timestamp type as us
unit one
We could probably fix this at arrow schema level. For example, we can add an additional conversion for physical_schema
here to change the unit of arrow timestamp from ns
to us
:
iceberg-python/pyiceberg/io/pyarrow.py
Lines 987 to 989 in f782c54
fragment = arrow_format.make_fragment(fin) | |
physical_schema = fragment.physical_schema | |
file_schema = pyarrow_to_schema(physical_schema, name_mapping) |
Changing the physical schema also ensures that the actual timestamp data is read with
us
unit as required by TimestampType
iceberg-python/pyiceberg/io/pyarrow.py
Lines 1002 to 1008 in f782c54
fragment_scanner = ds.Scanner.from_fragment( | |
fragment=fragment, | |
schema=physical_schema, | |
# This will push down the query to Arrow. | |
# But in case there are positional deletes, we have to apply them first | |
filter=pyarrow_filter if not positional_deletes else None, | |
columns=[col.name for col in file_project_schema.columns], |
However, I also feel it is not the ultimate solution because we assume the unit is microsecond. When TimestampNanoType
is in, we may need to do some additional steps to ensure we reads the data using the correct unit.
@MehulBatra @Fokko Would love to hear your thoughts on this. Please correct me if I make any mistakes about the ORC's behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes @HonahX you got it right arrow is reading the ORC timestamp unit as ns
that's why I added a schema conversion at the primitive types even after being aware that we will be supporting nanoseconds in the coming version3, but I feel your suggestion makes sense till the time we start supporting version3 altogether.
I will try to incorporate the changes to read ns
as us
.
@Fokko what do you think are we good with this, for the time being?
@HonahX Works for me and I believe it will also benefit the community to get unblocked at least on the read side meanwhile we can grind on the write support, I have already started on the write support changes I will raise a separate PR for the same and attach it on #20 |