Skip to content

BUG: Patch read_csv NA values behaviour #14751

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Nov 26, 2016

Patches the following behaviour when na_values is passed in as a dictionary:

  1. Prevent aliasing in case na_values was defined in a broader scope.
  2. Respect column indices as keys when doing NA conversions.

Closes #14203.

@codecov-io
Copy link

codecov-io commented Nov 26, 2016

Current coverage is 85.27% (diff: 100%)

No coverage report found for master at 033d345.

Powered by Codecov. Last update 033d345...0bd7531

@gfyoung gfyoung force-pushed the csv-na-values-patching branch 2 times, most recently from 8ebba67 to b9b0367 Compare November 26, 2016 22:31
@gfyoung
Copy link
Member Author

gfyoung commented Nov 27, 2016

Appveyor keeps failing because the build is taking too long, but I have no clue why that is the case (e.g. this build seemed to do just fine). I took a look at Travis to see if the builds much longer, but that is not the case.

@gfyoung gfyoung force-pushed the csv-na-values-patching branch 2 times, most recently from a2a5b63 to f086bac Compare November 27, 2016 04:24
@sinhrks sinhrks added Bug IO CSV read_csv, to_csv labels Nov 27, 2016
@gfyoung gfyoung force-pushed the csv-na-values-patching branch 2 times, most recently from 5bd7c10 to 0bd7531 Compare December 1, 2016 23:05
@gfyoung
Copy link
Member Author

gfyoung commented Dec 6, 2016

@jreback , @sinhrks : Everything is green, so ready to merge if there are no concerns.

clean_na_values = {}
clean_na_fvalues = {}

if isinstance(self.na_values, dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be better to have a single function in parsers.pyx which does this (e.g. returns cleaned na_value, na_fvalue). rather than have duplicate code in python and c parser.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm...perhaps, but the logic is very coupled with the class attributes of the C and Python parsers. In addition, the way the logic is applied (in bulk for Python, per iteration in C), merging the two would not be straightforward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(in bulk for Python, per iteration in C),

what does this mean?

this should be done once

Copy link
Member Author

@gfyoung gfyoung Dec 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at the _get_na_list method here for the C engine. It gets called on every iteration to extract the relevant na_values for the particular column in question.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's only called on column conversion (once).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There really is no difference. All we do is get the na_values for the given column, "clean it" and then use them to parse the rest of the column. In Python, we get the na_values for all relevant columns, clean them, and then use stored dictionary later on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just trying to reduce the amount of code
having a common method for all parsers is better than having specific things for each

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I perfectly understand. I just haven't found a clean way to do it yet, and it seems like it would be best to do as a follow-up if that works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh ok

let's do that then

lgtm

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. I'll ping again when everything goes green.

@gfyoung gfyoung force-pushed the csv-na-values-patching branch from 0bd7531 to cac422c Compare December 15, 2016 16:09
@jreback jreback modified the milestones: 0.20.0, 0.19.2 Dec 15, 2016
@gfyoung gfyoung changed the title Patch read_csv NA values behaviour BUG: Patch read_csv NA values behaviour Dec 15, 2016
@gfyoung
Copy link
Member Author

gfyoung commented Dec 16, 2016

@jreback : Everything is passing. Ready to merge if there are no other concerns.

@jreback
Copy link
Contributor

jreback commented Dec 16, 2016

thanks!

@jreback jreback closed this in dd8cba2 Dec 16, 2016
@gfyoung gfyoung deleted the csv-na-values-patching branch December 17, 2016 01:08
ischurov pushed a commit to ischurov/pandas that referenced this pull request Dec 19, 2016
Patches the following behaviour when `na_values` is passed in as a
dictionary:    1. Prevent aliasing in case `na_values` was defined in
a broader scope.  2. Respect column indices as keys when doing NA
conversions.    Closes pandas-dev#14203.

Author: gfyoung <[email protected]>

Closes pandas-dev#14751 from gfyoung/csv-na-values-patching and squashes the following commits:

cac422c [gfyoung] BUG: Respect column indices for dict-like na_values
1439c27 [gfyoung] BUG: Prevent aliasing of dict na_values
jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Dec 24, 2016
Patches the following behaviour when `na_values` is passed in as a
dictionary:    1. Prevent aliasing in case `na_values` was defined in
a broader scope.  2. Respect column indices as keys when doing NA
conversions.    Closes pandas-dev#14203.

Author: gfyoung <[email protected]>

Closes pandas-dev#14751 from gfyoung/csv-na-values-patching and squashes the following commits:

cac422c [gfyoung] BUG: Respect column indices for dict-like na_values
1439c27 [gfyoung] BUG: Prevent aliasing of dict na_values

(cherry picked from commit dd8cba2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants