-
Notifications
You must be signed in to change notification settings - Fork 266
Peformance question for to_arrow, to_pandas, to_duckdb #1032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @jkleinkauff , that's indeed an interesting observation. I have some follow up questions to help us understand it better.
As a way of comparison, I just ran a scan using to_arrow against a table that has 63, 5.5Mb Parquet files comprising the table. I'd imagine a table with less files to take less time to return (although the limit function here should ensure that we aren't even reading the other parquet files past the first one) It returned in 6 seconds. Your observation that limit 1~100 took similar times makes sense to me as well. If you have 100+ Mb files, you are going to have to download the same amount of data regardless to return the limited result. |
There's a nontrivial cost in reading metadata files in Iceberg.
This will show the number of manifest files https://py.iceberg.apache.org/api/#manifests |
Hey, thank you for taking a time to answer me!
I mean probably it's just my network? Yeah, even with limit=1 it seems scan is returning both files (just an observation, maybe it's intended): df = table.scan(limit=1)
# pa_table = df.to_arrow()
[print(task.file.file_path) for task in df.plan_files()]
# s3://xxx/xxx/curitiba_starts_june/data/00000-0-6984da88-fe64-4765-9137-739072becfb1.parquet
# s3://xxx/xxx/curitiba_starts_june/data/00000-0-1de29b8f-2e8c-4543-9663-f769d53b17b7.parquet Output of table.inspect.manifests().to_pandas() content path length partition_spec_id ... added_delete_files_count existing_delete_files_count deleted_delete_files_count partition_summaries
0 0 s3://xxx/bronze/curitiba_starts_june... 10433 0 ... 0 0 0 []
1 0 s3://xxx/bronze/curitiba_starts_june... 10430 0 ... 0 0 0 []
[2 rows x 12 columns] table = catalog.load_table("bronze.curitiba_starts_june")
df = table.scan(limit=100)
# print(table.inspect.manifests().to_pandas())
pa_table = df.to_arrow()
I can also share the files or a direct link to my files. Thank you! |
okay, this doesn't look like an issue with reading many metadata files. I wonder if the
I think this will give us some insights about read performance in pyiceberg For reading raw parquet files, you can do something like this,
|
Hi @kevinjqliu thank you for your time! Those are my findings: I've included a read_parquet method from awswrangler. Don't know why, but it's by far the fastest method. reading raw parquet file with awswrangler:
reading raw parquet file with pyarrow
reading entire iceberg table, without limits
reading iceberg table, with limit of 1
reading iceberg table with duckdb
eading iceberg table with duckdb, with limit of 1
|
Thanks for looking into the different scenarios. It looks like there are varying results depending on the engines. Read PathI took a deeper look into the read path for PyIceberg and discovered a missing optimization.
So given an Iceberg table with 2 files. The important note here is that the unit of work is per FileScanTask, which means per parquet file read. The limit will be checked only after each future completion. This means at least 1 parquet file will be read entirely before the limit is applied. The limit is never pushed down to the parquet file read level. Even when The potential optimization here is to pushdown the Unify implementationsAs a side note, there might be an opportunity to unify the implementation details of
|
@kevinjqliu that's awesome! Thank you so much ! |
@jkleinkauff Without profiling it's hard to say for sure, but I can make a bet based on my experience with object storage clients/fsspec implementations in particular:
This is one thing that will have a big impact, but there are likely other differences too since, as you can see, awswrangler is not using the fsspec s3fs implementation and has its own wrappers for i/o over S3... There's also a lot of stuff I see there in awswranglerthat is similar to some of the optimizations pyarrow makes in its native blob storage filesystem implementations; maybe the fsspec s3fs implementation supports some of these too, but I'm less familiar with current state of that project. At least historically, a lot of the defaults were tuned for random access, and not really the best for maximizing throughput in columnar workloads; pyiceberg probably should not be using all defaults. I believe pyiceberg also provides FileIO implementations that use pyarrow native filesystems instead of fsspec. I am curious what you find if you benchmark that? There may also be defaults there that pyiceberg should be tuning, but I suspect they will still be better tuned for the kinds of reads pyiceberg is doing. |
I was also surprised by the performance difference. It's hard for me to say unless I look into the implementation details (in awswrangler/s3/_read_parquet.py). There's definitely room for optimizations on the PyIceberg side. If you look at another engine like daft, which is optimized for reading parquet on S3, that's a good target for potential performance gains. On the PyIceberg side, there's a future opportunity to integrate with iceberg-rust, which might speed up reading files. |
There are two FileIO implementations, fsspec and pyarrow. In the case above, I believe pyarrow is used, since its preferred over fsspec (source) Looks like the pyarrow default buffer size is 1MB iceberg-python/pyiceberg/io/pyarrow.py Line 417 in f05b1ae
|
@sungwy @kevinjqliu I'm really enjoying this discussion and learning a ton from it. Would love to keep it going but feel free to close it as this is not an issue. Thank you folks! |
Thanks for reporting this. I learned a lot from exploring this thread, and we have some solid improvements coming up. Please let us know if anything else comes up! |
Question
Hey, thanks for this very convenient library.
This is not a bug, just want to better understand something.
I have a question regarding the performance - ie time to query the table (?) - for such methods.
The code above will run ok. My question is regarding the last command, to_arrow() transformation takes around 60s (+-) to execute. I believe this is mostly because of the network itself?
The execution time will stay roughly the same with different row limit (1, 10, 100).
Querying the same table in motherduck - using iceberg_scan - is faster:

When running the same query locally - without motherduck but still using duckdb - the execution time will be similar to what pyiceberg takes, actually it will be a bit slower. That's why I think this is mostly like a network "issue". Can you help me understand what's happening? Thank you!
Table Data
The table has two parquet files (110mb, 127mb)
The text was updated successfully, but these errors were encountered: