ENH: Add support for excluding the index from Parquet files (GH20768) #22266

dargueta · 2018-08-09T19:32:05Z

closes to_parquet method should accept index=False option #20768
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pep8speaks · 2018-08-09T19:32:09Z

Hello @dargueta! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 21, 2018 at 18:00 Hours UTC

pandas/core/frame.py

gfyoung · 2018-08-09T20:05:29Z

Good start! Going to need tests as well as a whatsnew entry (potentially a mini-section).

cc @jreback

dargueta · 2018-08-09T20:34:59Z

Going to need tests as well as a whatsnew entry (potentially a mini-section).

I won't have much time until later today but yeah, I'll finish that off!

dargueta · 2018-08-11T07:13:28Z

@gfyoung I'm not sure I completely understand what's going on in the unit tests. I think I've written a test that handles the expected cases for both engines but I'm not sure. I've never contributed to Pandas before. 😆

pandas/tests/io/test_parquet.py

doc/source/whatsnew/v0.24.0.txt

pandas/tests/io/test_parquet.py

dargueta · 2018-08-13T05:33:03Z

Do you have any idea what's causing these seemingly unrelated tests to fail on Travis? Like, my code should not be failing because "snappy compression is unavailable."

The failing appveyor test seems to be a bit more relevant but I don't fully understand it. It seems to not like the multi-index when read back even though I'm explicitly excluding it from consideration:

E       DataFrame.index classes are not equivalent
E       [left]:  MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
E                  labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]])
E       [right]: RangeIndex(start=0, stop=8, step=1)
pandas\util\testing.py:1076: AssertionError

These tests pass just fine on CircleCI, which is weird.

WillAyd · 2018-08-13T19:34:49Z

Just glancing at the Travis failures on 3.6 they seem related - looks like there is a difference in the name of the index before / after (None vs index)

dargueta · 2018-08-13T19:43:55Z

So why isn't it failing on CircleCI? Isn't it the same code?

Also, my changes should not be causing TestPythonParser.test_no_header_prefix to suddenly start failing, right? That's another one failing on Travis.

gfyoung · 2018-08-13T19:45:39Z

@dargueta : Good question! Not quite. The coverage for each build / platform is slightly different. That's why we use multiple platforms, for better or worse. 🙂

dargueta · 2018-08-13T19:49:48Z

Okay, good to know. Is there a straightforward way to ignore the index when loading it back, considering we're deliberately not writing it?

gfyoung · 2018-08-13T19:51:44Z

Remind me: why can't you pass in an argument for read_kwargs ?

dargueta · 2018-08-13T20:12:04Z

~~I meant for testing, not actual production.~~

This feature was intended to be only for writing, but I suppose I could add it for reading? Not sure how well that'd be supported by the Parquet libraries but I can check. Unless I'm misunderstanding your question?

Ignore that, I misunderstood the question. I think I fixed the issue, but other tests are still failing.

jorisvandenbossche

Already looking good!

Concerning the failures related to index names: one possible cause is the write_index keyword for fastparquet. Now you added that in the fastparquet.write call with a default of True. However, its default is not True in fastparquet, so this actually changes behaviour: https://github.com/dask/fastparquet/blob/8db811cc4701d5ae100a8e5c95685daec9e24c6b/fastparquet/writer.py#L793-L795

Only not fully sure what the best solution is here. We could restore the current behaviour (not write index by default for a default index, i.e. write_index=None), but it would also be nice to have it consistent between the different engines.

doc/source/whatsnew/v0.24.0.txt

pandas/tests/io/test_parquet.py

dargueta · 2018-08-14T19:15:48Z

Only not fully sure what the best solution is here. We could restore the current behaviour (not write index by default for a default index, i.e. write_index=None)

The problem is, pyarrow always writes the index. Different engines already have different behavior. So technically this change alters the existing fastparquet behavior.

I suppose we could have three values, None to use the default behavior for the engine, and then True or False for explicit control.

codecov · 2018-08-16T05:47:48Z

Codecov Report

Merging #22266 into master will increase coverage by <.01%.
The diff coverage is 90%.

@@            Coverage Diff             @@
##           master   #22266      +/-   ##
==========================================
+ Coverage   92.17%   92.18%   +<.01%     
==========================================
  Files         169      169              
  Lines       50778    50781       +3     
==========================================
+ Hits        46807    46810       +3     
  Misses       3971     3971

Flag	Coverage Δ
#multiple	`90.59% <90%> (ø)`	⬆️
#single	`42.32% <20%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/frame.py	`97.19% <ø> (ø)`	⬆️
pandas/io/parquet.py	`73.72% <90%> (+0.68%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c113db...7dc53a1. Read the comment docs.

jorisvandenbossche

Tests seem to be passing now!

doc/source/io.rst

pandas/core/frame.py

pandas/tests/io/test_parquet.py

dargueta · 2018-08-17T17:08:30Z

Last of the (so far) requested fixes are in!

jreback

why is there a None option here?

doc/source/io.rst

doc/source/whatsnew/v0.24.0.txt

pandas/core/frame.py

pandas/io/parquet.py

dargueta · 2018-08-20T22:45:34Z

why is there a None option here?

fastparquet doesn't write the index if it's an integer sequence 0-n, but pyarrow always writes the index by default. To preserve backwards compatibility, there's a None option that maintains the original engine-dependent behavior.

jreback · 2018-08-20T22:51:30Z

fastparquet doesn't write the index if it's an integer sequence 0-n, but pyarrow always writes the index by default. To preserve backwards compatibility, there's a None option that maintains the original engine-dependent behavior.

I c.

This makes fp not idempotent though, which is not a great situation here. I wonder if should make index=True/False the only option, deprecating None to preserve back compat with fp. Of course this would make it show on practically every use.

jorisvandenbossche · 2018-08-20T22:55:47Z

This makes fp not idempotent though, which is not a great situation here

Do you mean that it would not round-trip faithfully?
Because that is still the case, as it only does not write the index for a default, non-named index. So when reading such parquet file without an explicit index, the created index will be no different.

jreback · 2018-08-20T22:58:02Z

Do you mean that it would not round-trip faithfully?

is there some magic metadata that it does? The problem is that this round-trip is only good for fp-fp?

I think we should standardize on the pandas side with index=True

dargueta · 2018-08-20T23:01:01Z

I think we should standardize on the pandas side with index=True

That's technically an abrupt backwards-incompatible change for anyone using fastparquet to export dataframes for use by another consumer such as a database.

Personally, for minimal disruption I believe we should release this feature with None as the default, with an explicit note that True will be the default in the next backwards-incompatible release.

dargueta · 2018-08-24T21:58:51Z

@jreback and @jorisvandenbossche is there anything else you think needs to be addressed?

jorisvandenbossche

This looks good to me.

@jreback any further comments?

We can always discuss later if we want to change the default of index to always be True or to something else (eg use the fastparquet logic ourselves)

doc/source/io.rst

pandas/io/parquet.py

jreback · 2018-09-15T12:38:18Z

can you rebase

dargueta · 2018-09-19T23:39:38Z

@jreback rebase is done

jorisvandenbossche · 2018-09-21T08:18:09Z

@dargueta Thanks a lot!

…pandas-dev#22266)

gfyoung added Enhancement IO Parquet parquet, feather labels Aug 9, 2018

gfyoung reviewed Aug 9, 2018

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

gfyoung reviewed Aug 11, 2018

View reviewed changes

pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved

gfyoung reviewed Aug 11, 2018

View reviewed changes

doc/source/whatsnew/v0.24.0.txt Outdated Show resolved Hide resolved

chris-b1 reviewed Aug 11, 2018

View reviewed changes

pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Aug 14, 2018

View reviewed changes

jorisvandenbossche reviewed Aug 16, 2018

View reviewed changes

doc/source/io.rst Outdated Show resolved Hide resolved

doc/source/io.rst Outdated Show resolved Hide resolved

pandas/core/frame.py Outdated Show resolved Hide resolved

pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved

jreback requested changes Aug 20, 2018

View reviewed changes

doc/source/io.rst Show resolved Hide resolved

doc/source/io.rst Show resolved Hide resolved

doc/source/whatsnew/v0.24.0.txt Outdated Show resolved Hide resolved

pandas/core/frame.py Show resolved Hide resolved

pandas/io/parquet.py Show resolved Hide resolved

jorisvandenbossche approved these changes Sep 5, 2018

View reviewed changes

jorisvandenbossche changed the title ~~ENH20768 Add support for excluding the index from Parquet files~~ ENH: Add support for excluding the index from Parquet files (GH20768) Sep 5, 2018

jreback reviewed Sep 5, 2018

View reviewed changes

doc/source/io.rst Outdated Show resolved Hide resolved

pandas/io/parquet.py Show resolved Hide resolved

Diego Argueta and others added 17 commits September 19, 2018 16:08

Add support for excluding the index from Parquet files

847598b

Update whatsnew

cb01127

Test index omission?

3bec3c2

PR feedback

377cda5

Add tests for custom indexes and a multiindex.

ec58c1a

Forgot to put preserve_index=index in one place

46209e5

Use engine fixture to test both implementations.

45b864d

Fix tests: Remove indexes in expected value.

5768b53

Move explanation of new argument to io.rst

f8bcf60

Don't validate the index if we're not writing it.

e629ae8

Test bugfixes and PR feedback.

f3ddae0

Allow using engine's default behavior.

d26fea8

Document behavior change.

e54e5f1

Code cleanup, PR feedback.

46a4324

PR feedback for documentation

90361b6

add versionadded

759da77

PR feedback about rephrasing

7dc53a1

jorisvandenbossche added this to the 0.24.0 milestone Sep 21, 2018

jorisvandenbossche merged commit bdb7a16 into pandas-dev:master Sep 21, 2018

dargueta deleted the parquet-index-support branch September 21, 2018 21:16

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

ENH: Add support for excluding the index from Parquet files (GH20768) (…

bf598b2

…pandas-dev#22266)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add support for excluding the index from Parquet files (GH20768) #22266

ENH: Add support for excluding the index from Parquet files (GH20768) #22266

dargueta commented Aug 9, 2018 •

edited

Loading

pep8speaks commented Aug 9, 2018 •

edited

Loading

gfyoung commented Aug 9, 2018

dargueta commented Aug 9, 2018 •

edited

Loading

dargueta commented Aug 11, 2018 •

edited

Loading

dargueta commented Aug 13, 2018 •

edited

Loading

WillAyd commented Aug 13, 2018

dargueta commented Aug 13, 2018

gfyoung commented Aug 13, 2018

dargueta commented Aug 13, 2018

gfyoung commented Aug 13, 2018 •

edited

Loading

dargueta commented Aug 13, 2018 •

edited

Loading

jorisvandenbossche left a comment

dargueta commented Aug 14, 2018 •

edited

Loading

codecov bot commented Aug 16, 2018 •

edited

Loading

jorisvandenbossche left a comment

dargueta commented Aug 17, 2018

jreback left a comment

dargueta commented Aug 20, 2018 •

edited

Loading

jreback commented Aug 20, 2018

jorisvandenbossche commented Aug 20, 2018

jreback commented Aug 20, 2018

dargueta commented Aug 20, 2018 •

edited

Loading

dargueta commented Aug 24, 2018

jorisvandenbossche left a comment

jreback commented Sep 15, 2018

dargueta commented Sep 19, 2018

jorisvandenbossche commented Sep 21, 2018

ENH: Add support for excluding the index from Parquet files (GH20768) #22266

ENH: Add support for excluding the index from Parquet files (GH20768) #22266

Conversation

dargueta commented Aug 9, 2018 • edited Loading

pep8speaks commented Aug 9, 2018 • edited Loading

Comment last updated on August 21, 2018 at 18:00 Hours UTC

gfyoung commented Aug 9, 2018

dargueta commented Aug 9, 2018 • edited Loading

dargueta commented Aug 11, 2018 • edited Loading

dargueta commented Aug 13, 2018 • edited Loading

WillAyd commented Aug 13, 2018

dargueta commented Aug 13, 2018

gfyoung commented Aug 13, 2018

dargueta commented Aug 13, 2018

gfyoung commented Aug 13, 2018 • edited Loading

dargueta commented Aug 13, 2018 • edited Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

dargueta commented Aug 14, 2018 • edited Loading

codecov bot commented Aug 16, 2018 • edited Loading

Codecov Report

jorisvandenbossche left a comment

Choose a reason for hiding this comment

dargueta commented Aug 17, 2018

jreback left a comment

Choose a reason for hiding this comment

dargueta commented Aug 20, 2018 • edited Loading

jreback commented Aug 20, 2018

jorisvandenbossche commented Aug 20, 2018

jreback commented Aug 20, 2018

dargueta commented Aug 20, 2018 • edited Loading

dargueta commented Aug 24, 2018

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jreback commented Sep 15, 2018

dargueta commented Sep 19, 2018

jorisvandenbossche commented Sep 21, 2018

dargueta commented Aug 9, 2018 •

edited

Loading

pep8speaks commented Aug 9, 2018 •

edited

Loading

dargueta commented Aug 9, 2018 •

edited

Loading

dargueta commented Aug 11, 2018 •

edited

Loading

dargueta commented Aug 13, 2018 •

edited

Loading

gfyoung commented Aug 13, 2018 •

edited

Loading

dargueta commented Aug 13, 2018 •

edited

Loading

dargueta commented Aug 14, 2018 •

edited

Loading

codecov bot commented Aug 16, 2018 •

edited

Loading

dargueta commented Aug 20, 2018 •

edited

Loading

dargueta commented Aug 20, 2018 •

edited

Loading