Skip to content

ENH: Add support for excluding the index from Parquet files (GH20768) #22266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Sep 21, 2018
Merged

ENH: Add support for excluding the index from Parquet files (GH20768) #22266

merged 17 commits into from
Sep 21, 2018

Conversation

dargueta
Copy link
Contributor

@dargueta dargueta commented Aug 9, 2018

@pep8speaks
Copy link

pep8speaks commented Aug 9, 2018

Hello @dargueta! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 21, 2018 at 18:00 Hours UTC

@gfyoung gfyoung added Enhancement IO Parquet parquet, feather labels Aug 9, 2018
@gfyoung
Copy link
Member

gfyoung commented Aug 9, 2018

Good start! Going to need tests as well as a whatsnew entry (potentially a mini-section).

cc @jreback

@dargueta
Copy link
Contributor Author

dargueta commented Aug 9, 2018

Going to need tests as well as a whatsnew entry (potentially a mini-section).

I won't have much time until later today but yeah, I'll finish that off!

@dargueta
Copy link
Contributor Author

dargueta commented Aug 11, 2018

@gfyoung I'm not sure I completely understand what's going on in the unit tests. I think I've written a test that handles the expected cases for both engines but I'm not sure. I've never contributed to Pandas before. 😆

@dargueta
Copy link
Contributor Author

dargueta commented Aug 13, 2018

Do you have any idea what's causing these seemingly unrelated tests to fail on Travis? Like, my code should not be failing because "snappy compression is unavailable."

The failing appveyor test seems to be a bit more relevant but I don't fully understand it. It seems to not like the multi-index when read back even though I'm explicitly excluding it from consideration:

E       DataFrame.index classes are not equivalent
E       [left]:  MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
E                  labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]])
E       [right]: RangeIndex(start=0, stop=8, step=1)
pandas\util\testing.py:1076: AssertionError

These tests pass just fine on CircleCI, which is weird.

@WillAyd
Copy link
Member

WillAyd commented Aug 13, 2018

Just glancing at the Travis failures on 3.6 they seem related - looks like there is a difference in the name of the index before / after (None vs index)

@dargueta
Copy link
Contributor Author

So why isn't it failing on CircleCI? Isn't it the same code?

Also, my changes should not be causing TestPythonParser.test_no_header_prefix to suddenly start failing, right? That's another one failing on Travis.

@gfyoung
Copy link
Member

gfyoung commented Aug 13, 2018

@dargueta : Good question! Not quite. The coverage for each build / platform is slightly different. That's why we use multiple platforms, for better or worse. 🙂

@dargueta
Copy link
Contributor Author

Okay, good to know. Is there a straightforward way to ignore the index when loading it back, considering we're deliberately not writing it?

@gfyoung
Copy link
Member

gfyoung commented Aug 13, 2018

Remind me: why can't you pass in an argument for read_kwargs ?

@dargueta
Copy link
Contributor Author

dargueta commented Aug 13, 2018

I meant for testing, not actual production.

This feature was intended to be only for writing, but I suppose I could add it for reading? Not sure how well that'd be supported by the Parquet libraries but I can check. Unless I'm misunderstanding your question?

Ignore that, I misunderstood the question. I think I fixed the issue, but other tests are still failing.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already looking good!

Concerning the failures related to index names: one possible cause is the write_index keyword for fastparquet. Now you added that in the fastparquet.write call with a default of True. However, its default is not True in fastparquet, so this actually changes behaviour: https://github.com/dask/fastparquet/blob/8db811cc4701d5ae100a8e5c95685daec9e24c6b/fastparquet/writer.py#L793-L795

Only not fully sure what the best solution is here. We could restore the current behaviour (not write index by default for a default index, i.e. write_index=None), but it would also be nice to have it consistent between the different engines.

@dargueta
Copy link
Contributor Author

dargueta commented Aug 14, 2018

Only not fully sure what the best solution is here. We could restore the current behaviour (not write index by default for a default index, i.e. write_index=None)

The problem is, pyarrow always writes the index. Different engines already have different behavior. So technically this change alters the existing fastparquet behavior.

I suppose we could have three values, None to use the default behavior for the engine, and then True or False for explicit control.

@codecov
Copy link

codecov bot commented Aug 16, 2018

Codecov Report

Merging #22266 into master will increase coverage by <.01%.
The diff coverage is 90%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22266      +/-   ##
==========================================
+ Coverage   92.17%   92.18%   +<.01%     
==========================================
  Files         169      169              
  Lines       50778    50781       +3     
==========================================
+ Hits        46807    46810       +3     
  Misses       3971     3971
Flag Coverage Δ
#multiple 90.59% <90%> (ø) ⬆️
#single 42.32% <20%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 97.19% <ø> (ø) ⬆️
pandas/io/parquet.py 73.72% <90%> (+0.68%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c113db...7dc53a1. Read the comment docs.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests seem to be passing now!

@dargueta
Copy link
Contributor Author

Last of the (so far) requested fixes are in!

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is there a None option here?

@dargueta
Copy link
Contributor Author

dargueta commented Aug 20, 2018

why is there a None option here?

fastparquet doesn't write the index if it's an integer sequence 0-n, but pyarrow always writes the index by default. To preserve backwards compatibility, there's a None option that maintains the original engine-dependent behavior.

@jreback
Copy link
Contributor

jreback commented Aug 20, 2018

fastparquet doesn't write the index if it's an integer sequence 0-n, but pyarrow always writes the index by default. To preserve backwards compatibility, there's a None option that maintains the original engine-dependent behavior.

I c.

This makes fp not idempotent though, which is not a great situation here. I wonder if should make index=True/False the only option, deprecating None to preserve back compat with fp. Of course this would make it show on practically every use.

@jorisvandenbossche
Copy link
Member

This makes fp not idempotent though, which is not a great situation here

Do you mean that it would not round-trip faithfully?
Because that is still the case, as it only does not write the index for a default, non-named index. So when reading such parquet file without an explicit index, the created index will be no different.

@jreback
Copy link
Contributor

jreback commented Aug 20, 2018

Do you mean that it would not round-trip faithfully?

is there some magic metadata that it does? The problem is that this round-trip is only good for fp-fp?

I think we should standardize on the pandas side with index=True

@dargueta
Copy link
Contributor Author

dargueta commented Aug 20, 2018

I think we should standardize on the pandas side with index=True

That's technically an abrupt backwards-incompatible change for anyone using fastparquet to export dataframes for use by another consumer such as a database.

Personally, for minimal disruption I believe we should release this feature with None as the default, with an explicit note that True will be the default in the next backwards-incompatible release.

@dargueta
Copy link
Contributor Author

@jreback and @jorisvandenbossche is there anything else you think needs to be addressed?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

@jreback any further comments?

We can always discuss later if we want to change the default of index to always be True or to something else (eg use the fastparquet logic ourselves)

@jorisvandenbossche jorisvandenbossche changed the title ENH20768 Add support for excluding the index from Parquet files ENH: Add support for excluding the index from Parquet files (GH20768) Sep 5, 2018
@jreback
Copy link
Contributor

jreback commented Sep 15, 2018

can you rebase

@dargueta
Copy link
Contributor Author

@jreback rebase is done

@jorisvandenbossche jorisvandenbossche added this to the 0.24.0 milestone Sep 21, 2018
@jorisvandenbossche jorisvandenbossche merged commit bdb7a16 into pandas-dev:master Sep 21, 2018
@jorisvandenbossche
Copy link
Member

@dargueta Thanks a lot!

@dargueta dargueta deleted the parquet-index-support branch September 21, 2018 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_parquet method should accept index=False option
7 participants