-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add support for excluding the index from Parquet files (GH20768) #22266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add support for excluding the index from Parquet files (GH20768) #22266
Conversation
Hello @dargueta! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on August 21, 2018 at 18:00 Hours UTC |
Good start! Going to need tests as well as a cc @jreback |
I won't have much time until later today but yeah, I'll finish that off! |
@gfyoung I'm not sure I completely understand what's going on in the unit tests. I think I've written a test that handles the expected cases for both engines but I'm not sure. I've never contributed to Pandas before. 😆 |
Do you have any idea what's causing these seemingly unrelated tests to fail on Travis? Like, my code should not be failing because "snappy compression is unavailable." The failing appveyor test seems to be a bit more relevant but I don't fully understand it. It seems to not like the multi-index when read back even though I'm explicitly excluding it from consideration:
These tests pass just fine on CircleCI, which is weird. |
Just glancing at the Travis failures on 3.6 they seem related - looks like there is a difference in the name of the index before / after ( |
So why isn't it failing on CircleCI? Isn't it the same code? Also, my changes should not be causing |
@dargueta : Good question! Not quite. The coverage for each build / platform is slightly different. That's why we use multiple platforms, for better or worse. 🙂 |
Okay, good to know. Is there a straightforward way to ignore the index when loading it back, considering we're deliberately not writing it? |
Remind me: why can't you pass in an argument for |
Ignore that, I misunderstood the question. I think I fixed the issue, but other tests are still failing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already looking good!
Concerning the failures related to index names: one possible cause is the write_index
keyword for fastparquet. Now you added that in the fastparquet.write
call with a default of True
. However, its default is not True in fastparquet, so this actually changes behaviour: https://github.com/dask/fastparquet/blob/8db811cc4701d5ae100a8e5c95685daec9e24c6b/fastparquet/writer.py#L793-L795
Only not fully sure what the best solution is here. We could restore the current behaviour (not write index by default for a default index, i.e. write_index=None
), but it would also be nice to have it consistent between the different engines.
The problem is, I suppose we could have three values, |
Codecov Report
@@ Coverage Diff @@
## master #22266 +/- ##
==========================================
+ Coverage 92.17% 92.18% +<.01%
==========================================
Files 169 169
Lines 50778 50781 +3
==========================================
+ Hits 46807 46810 +3
Misses 3971 3971
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests seem to be passing now!
Last of the (so far) requested fixes are in! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is there a None
option here?
|
I c. This makes fp not idempotent though, which is not a great situation here. I wonder if should make |
Do you mean that it would not round-trip faithfully? |
is there some magic metadata that it does? The problem is that this round-trip is only good for fp-fp? I think we should standardize on the pandas side with |
That's technically an abrupt backwards-incompatible change for anyone using Personally, for minimal disruption I believe we should release this feature with |
@jreback and @jorisvandenbossche is there anything else you think needs to be addressed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
@jreback any further comments?
We can always discuss later if we want to change the default of index
to always be True or to something else (eg use the fastparquet logic ourselves)
can you rebase |
@jreback rebase is done |
@dargueta Thanks a lot! |
git diff upstream/master -u -- "*.py" | flake8 --diff