-
Notifications
You must be signed in to change notification settings - Fork 71
TableScan filters to PyArrow DNF format #1130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more @@ Coverage Diff @@
## main #1130 +/- ##
==========================================
+ Coverage 81.42% 81.53% +0.11%
==========================================
Files 78 78
Lines 4392 4392
Branches 796 796
==========================================
+ Hits 3576 3581 +5
+ Misses 639 630 -9
- Partials 177 181 +4 see 1 file with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jdye64 ! Left a few comments, changes generally look good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Playing around with the example a bit, we end up getting filterable_exprs
that looks something like:
('df.a', 'in', ['Int64(1)', 'Int64(2)', 'Int64(3)', 'Int64(4)', 'Int64(5)'])
when passing this down to the reader we'd see the following issues:
df.a
the reader doesn't understand this and is probably expecting a
. I can just split by .
on the python side but was curious if there's something better than canonical
name that would give the column name.
'Int64(1)'
Doesn't map to anything. Is it possible to return the literal values in this case? If not we would probably need to handle this on the python side.
I can easily make the change for |
I need some pieces from #1085 to complete this so waiting on that to merge first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might also be worth adding a few tests on the rust side to ensure things are working as expected.
@ayushdg had Rust tests before. Would have loved to have kept it but now that we using the GIL in these methods it seems there is a bug in PyO3 that is preventing me for doing so. Once again this issue only occurs when using the Python GIL in Rust tests. Take a look at this FAQ page. I tried all of their recommendations but none of them worked. Still kept hitting linker errors so the project fails to build if I include them. https://pyo3.rs/v0.18.3/faq Long story short I don't see a way we can add those Rust tests until that issue is resolved on the PyO3 side. |
filters | ||
.iter() | ||
.for_each(|f| match PyTableScan::_expand_dnf_filter(f, py) { | ||
Ok(mut expanded_dnf_filter) => filtered_exprs.append(&mut expanded_dnf_filter), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just leaving this in as a note for followup work:
The approach of appending to the filtered list of expr's wouldn't return a dnf for more complex cases and right now is only limited to expr's that are and
with each other.
eg: a in (1,2,3,4,5) or b=5
would not get expanded by this logic.
This PR adds the support to get the filters for a
LogicalPlan::TableScan
instance in the DNF format that is expected by PyArrow for performing predicate pushdown. Note this PR only adds support forExpr::InList
for now.An example of calling this functionality from Python goes like this