Skip to content

TableScan filters to PyArrow DNF format #1130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
May 8, 2023
Merged

Conversation

jdye64
Copy link
Collaborator

@jdye64 jdye64 commented May 2, 2023

This PR adds the support to get the filters for a LogicalPlan::TableScan instance in the DNF format that is expected by PyArrow for performing predicate pushdown. Note this PR only adds support for Expr::InList for now.

An example of calling this functionality from Python goes like this

# Rust table_scan instance handle obtained from LogicalPlan instance, "rel"
table_scan = rel.table_scan()
dnf_filters = table_scan.getDNFFilters()

# dnf_filters is of type `PyFilteredResult` from Rust. Contains the filters that are not supported by PyArrow DNF
# and also ones that can.
none_filterable_exprs = dnf_filters.io_unfilterable_exprs()
filterable_exprs = dnf_filters.filtered_exprs() # Tuple of (str, str, [str]) Ex: ('id', 'in', ['Int32(1)', 'Int32(2)'])

@codecov-commenter
Copy link

codecov-commenter commented May 2, 2023

Codecov Report

Merging #1130 (33e022a) into main (3859629) will increase coverage by 0.11%.
The diff coverage is n/a.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##             main    #1130      +/-   ##
==========================================
+ Coverage   81.42%   81.53%   +0.11%     
==========================================
  Files          78       78              
  Lines        4392     4392              
  Branches      796      796              
==========================================
+ Hits         3576     3581       +5     
+ Misses        639      630       -9     
- Partials      177      181       +4     

see 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ayushdg ayushdg requested a review from sarahyurick May 2, 2023 17:12
Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jdye64 ! Left a few comments, changes generally look good to me.

@jdye64 jdye64 requested review from sarahyurick and ayushdg May 2, 2023 21:20
Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Playing around with the example a bit, we end up getting filterable_exprs that looks something like:

('df.a', 'in', ['Int64(1)', 'Int64(2)', 'Int64(3)', 'Int64(4)', 'Int64(5)'])

when passing this down to the reader we'd see the following issues:
df.a the reader doesn't understand this and is probably expecting a. I can just split by . on the python side but was curious if there's something better than canonical name that would give the column name.

'Int64(1)' Doesn't map to anything. Is it possible to return the literal values in this case? If not we would probably need to handle this on the python side.

@jdye64
Copy link
Collaborator Author

jdye64 commented May 3, 2023

Playing around with the example a bit, we end up getting filterable_exprs that looks something like:

('df.a', 'in', ['Int64(1)', 'Int64(2)', 'Int64(3)', 'Int64(4)', 'Int64(5)'])

when passing this down to the reader we'd see the following issues: df.a the reader doesn't understand this and is probably expecting a. I can just split by . on the python side but was curious if there's something better than canonical name that would give the column name.

'Int64(1)' Doesn't map to anything. Is it possible to return the literal values in this case? If not we would probably need to handle this on the python side.

I can easily make the change for df.a to a. I had purposely introduced the "Int64(" part for literals thinking that would assist you in determining if it was a string or a integer type. Since Rust can't handle generic types like Python I cannot directly return a "wildcard" but I can return either a PyExpr instance their OR and probably better can return a PyAny object. Let me give that a try here in a few.

@jdye64
Copy link
Collaborator Author

jdye64 commented May 4, 2023

I need some pieces from #1085 to complete this so waiting on that to merge first.

Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might also be worth adding a few tests on the rust side to ensure things are working as expected.

@jdye64
Copy link
Collaborator Author

jdye64 commented May 5, 2023

@ayushdg had Rust tests before. Would have loved to have kept it but now that we using the GIL in these methods it seems there is a bug in PyO3 that is preventing me for doing so. Once again this issue only occurs when using the Python GIL in Rust tests. Take a look at this FAQ page. I tried all of their recommendations but none of them worked. Still kept hitting linker errors so the project fails to build if I include them. https://pyo3.rs/v0.18.3/faq

Long story short I don't see a way we can add those Rust tests until that issue is resolved on the PyO3 side.

Comment on lines +138 to +141
filters
.iter()
.for_each(|f| match PyTableScan::_expand_dnf_filter(f, py) {
Ok(mut expanded_dnf_filter) => filtered_exprs.append(&mut expanded_dnf_filter),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just leaving this in as a note for followup work:
The approach of appending to the filtered list of expr's wouldn't return a dnf for more complex cases and right now is only limited to expr's that are and with each other.
eg: a in (1,2,3,4,5) or b=5 would not get expanded by this logic.

@ayushdg ayushdg merged commit e25abb9 into dask-contrib:main May 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants