-
Notifications
You must be signed in to change notification settings - Fork 269
[feature] Table Scan should take into account the table's sort order #1637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@kevinjqliu I would like to work on this, but i need a bit of clarification on what to do exactly? |
hey @iyad-f sure thing. Iceberg has the concept of sort order https://iceberg.apache.org/spec/#sorting I think there are two components to this.
Let me know if that's clear. Happy to chat more |
Keep in mind that the sort order is not a global sort-order but on a file level. This is very nice if you do joins (using the sort-merge strategy) and can avoid an additional sort. Also, this plays a huge role in compression since if you use run-length encoding, you can very efficiently encode the values. Since we have all the transforms in, I think it would be a good time to check if we can implement sort-order on the write side of things. |
If this is still unassigned I'll take it. seems fun :) |
sure @gabeiglio feel free to tag me for review :) |
IIUC is already happening as _InclusiveMetricsEvaluator does check for the lower and upper bounds of the file, so we would only be looking for a way to optimize the data scan at the arrow level? @kevinjqliu |
It does, but i dont think sort order is currently applied on the read side for the metadata level. I havent dug into the process yet so bare with me. I would assume that setting a sort order on a table will allow us to skip even more data files. Currently the evaluator would look at each data file and evaluate the lower/upper bound. But if the column is sorted, we can apply a binary search. Not sure how this is done on the spark/java side but im definitely interested to learn more.
yea i wonder if pyarrow already allows us to pass in a sort order. |
Feature Request / Improvement
From slack: https://apache-iceberg.slack.com/archives/C029EE6HQ5D/p1739184019493269
This should probably be in
plan_files
iceberg-python/pyiceberg/table/__init__.py
Lines 1485 to 1547 in d47970b
The text was updated successfully, but these errors were encountered: