TableScan filters to PyArrow DNF format #1130

jdye64 · 2023-05-02T15:30:09Z

This PR adds the support to get the filters for a LogicalPlan::TableScan instance in the DNF format that is expected by PyArrow for performing predicate pushdown. Note this PR only adds support for Expr::InList for now.

An example of calling this functionality from Python goes like this

# Rust table_scan instance handle obtained from LogicalPlan instance, "rel"
table_scan = rel.table_scan()
dnf_filters = table_scan.getDNFFilters()

# dnf_filters is of type `PyFilteredResult` from Rust. Contains the filters that are not supported by PyArrow DNF
# and also ones that can.
none_filterable_exprs = dnf_filters.io_unfilterable_exprs()
filterable_exprs = dnf_filters.filtered_exprs() # Tuple of (str, str, [str]) Ex: ('id', 'in', ['Int32(1)', 'Int32(2)'])

codecov-commenter · 2023-05-02T15:40:10Z

Codecov Report

Merging #1130 (33e022a) into main (3859629) will increase coverage by 0.11%.
The diff coverage is n/a.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##             main    #1130      +/-   ##
==========================================
+ Coverage   81.42%   81.53%   +0.11%     
==========================================
  Files          78       78              
  Lines        4392     4392              
  Branches      796      796              
==========================================
+ Hits         3576     3581       +5     
+ Misses        639      630       -9     
- Partials      177      181       +4

see 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

dask_planner/src/sql/logical/table_scan.rs

…tifying column variants

…nt type

sarahyurick

Thanks @jdye64 ! Left a few comments, changes generally look good to me.

dask_planner/src/sql/logical/table_scan.rs

sarahyurick

LGTM

ayushdg

Playing around with the example a bit, we end up getting filterable_exprs that looks something like:

('df.a', 'in', ['Int64(1)', 'Int64(2)', 'Int64(3)', 'Int64(4)', 'Int64(5)'])

when passing this down to the reader we'd see the following issues:
df.a the reader doesn't understand this and is probably expecting a. I can just split by . on the python side but was curious if there's something better than canonical name that would give the column name.

'Int64(1)' Doesn't map to anything. Is it possible to return the literal values in this case? If not we would probably need to handle this on the python side.

jdye64 · 2023-05-03T15:27:02Z

Playing around with the example a bit, we end up getting filterable_exprs that looks something like:
('df.a', 'in', ['Int64(1)', 'Int64(2)', 'Int64(3)', 'Int64(4)', 'Int64(5)'])
when passing this down to the reader we'd see the following issues: df.a the reader doesn't understand this and is probably expecting a. I can just split by . on the python side but was curious if there's something better than canonical name that would give the column name.

'Int64(1)' Doesn't map to anything. Is it possible to return the literal values in this case? If not we would probably need to handle this on the python side.

I can easily make the change for df.a to a. I had purposely introduced the "Int64(" part for literals thinking that would assist you in determining if it was a string or a integer type. Since Rust can't handle generic types like Python I cannot directly return a "wildcard" but I can return either a PyExpr instance their OR and probably better can return a PyAny object. Let me give that a try here in a few.

jdye64 · 2023-05-04T12:38:02Z

I need some pieces from #1085 to complete this so waiting on that to merge first.

ayushdg

Might also be worth adding a few tests on the rust side to ensure things are working as expected.

dask_planner/src/sql/logical/table_scan.rs

jdye64 · 2023-05-05T15:57:52Z

@ayushdg had Rust tests before. Would have loved to have kept it but now that we using the GIL in these methods it seems there is a bug in PyO3 that is preventing me for doing so. Once again this issue only occurs when using the Python GIL in Rust tests. Take a look at this FAQ page. I tried all of their recommendations but none of them worked. Still kept hitting linker errors so the project fails to build if I include them. https://pyo3.rs/v0.18.3/faq

Long story short I don't see a way we can add those Rust tests until that issue is resolved on the PyO3 side.

ayushdg · 2023-05-08T19:48:12Z

dask_planner/src/sql/logical/table_scan.rs

+        filters
+            .iter()
+            .for_each(|f| match PyTableScan::_expand_dnf_filter(f, py) {
+                Ok(mut expanded_dnf_filter) => filtered_exprs.append(&mut expanded_dnf_filter),


Just leaving this in as a note for followup work:
The approach of appending to the filtered list of expr's wouldn't return a dnf for more complex cases and right now is only limited to expr's that are and with each other.
eg: a in (1,2,3,4,5) or b=5 would not get expanded by this logic.

jdye64 added 3 commits April 12, 2023 20:29

Merge remote-tracking branch 'upstream/main'

fa18499

Merge remote-tracking branch 'upstream/main'

d6b470c

Add methods for converting table_scan filters into pyarrow DNF format

82fd9de

jdye64 requested review from ayushdg, charlesbluca and galipremsagar as code owners May 2, 2023 15:30

ayushdg requested a review from sarahyurick May 2, 2023 17:12

ayushdg reviewed May 2, 2023

View reviewed changes

dask_planner/src/sql/logical/table_scan.rs Outdated Show resolved Hide resolved

dask_planner/src/sql/logical/table_scan.rs Outdated Show resolved Hide resolved

jdye64 added 2 commits May 2, 2023 14:08

Add check for not in / negated and also explicit checks for expr iden…

0112804

…tifying column variants

Add checks to ensure InList expr types are of column or literal varia…

18e713c

…nt type

sarahyurick reviewed May 2, 2023

View reviewed changes

dask_planner/src/sql/logical/table_scan.rs Show resolved Hide resolved

dask_planner/src/sql/logical/table_scan.rs Outdated Show resolved Hide resolved

Updated to use Expr::InList in rust test

d617538

jdye64 requested review from sarahyurick and ayushdg May 2, 2023 21:20

Merge branch 'main' into tbl_scan_dnf

d6b3891

sarahyurick approved these changes May 2, 2023

View reviewed changes

ayushdg reviewed May 3, 2023

View reviewed changes

Merge branch 'main' into tbl_scan_dnf

128ad18

jdye64 added 3 commits May 4, 2023 13:55

stage commit

2e1178d

merge with upstream/main

da263c3

Logic for converting most used ScalarValue types to PyObject(s)

8843417

ayushdg requested changes May 4, 2023

View reviewed changes

dask_planner/src/sql/logical/table_scan.rs Outdated Show resolved Hide resolved

dask_planner/src/sql/logical/table_scan.rs Outdated Show resolved Hide resolved

jdye64 and others added 2 commits May 5, 2023 12:19

Add exception if ScalarValue is not of a certain type

26384b5

Merge branch 'main' into tbl_scan_dnf

33e022a

ayushdg mentioned this pull request May 8, 2023

Extend attempt_predicate_pushdown to accept additional filters as arguments #1138

Closed

ayushdg reviewed May 8, 2023

View reviewed changes

ayushdg approved these changes May 8, 2023

View reviewed changes

ayushdg merged commit e25abb9 into dask-contrib:main May 8, 2023

ayushdg mentioned this pull request May 9, 2023

[do-not-merge] Pass additional dnf filters in table_scan #1140

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TableScan filters to PyArrow DNF format #1130

TableScan filters to PyArrow DNF format #1130

Uh oh!

jdye64 commented May 2, 2023

Uh oh!

codecov-commenter commented May 2, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

sarahyurick left a comment

Uh oh!

Uh oh!

Uh oh!

sarahyurick left a comment

Uh oh!

ayushdg left a comment

Uh oh!

jdye64 commented May 3, 2023

Uh oh!

jdye64 commented May 4, 2023

Uh oh!

ayushdg left a comment

Uh oh!

Uh oh!

Uh oh!

jdye64 commented May 5, 2023

Uh oh!

ayushdg May 8, 2023

Uh oh!

Uh oh!

TableScan filters to PyArrow DNF format #1130

TableScan filters to PyArrow DNF format #1130

Uh oh!

Conversation

jdye64 commented May 2, 2023

Uh oh!

codecov-commenter commented May 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

ayushdg left a comment

Choose a reason for hiding this comment

Uh oh!

jdye64 commented May 3, 2023

Uh oh!

jdye64 commented May 4, 2023

Uh oh!

ayushdg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jdye64 commented May 5, 2023

Uh oh!

ayushdg May 8, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented May 2, 2023 •

edited

Loading