Skip to content

DOC: Clarify how date_parser is called (GH9376) #9377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 1, 2015
Merged

DOC: Clarify how date_parser is called (GH9376) #9377

merged 1 commit into from
Feb 1, 2015

Conversation

cmeeren
Copy link
Contributor

@cmeeren cmeeren commented Jan 30, 2015

closes #9376

@jreback
Copy link
Contributor

jreback commented Jan 30, 2015

I can buy this. @jorisvandenbossche

@jreback jreback added this to the 0.16.0 milestone Jan 30, 2015
@jorisvandenbossche
Copy link
Member

Yep, looking good. Although, in reality it is still a little bit more complex, as it are actually three steps that are tried, the first and last the onces you mentioned (vectorized with the columns as input, and scalar with rows), but also vectorized on the concatenated columns into one column. Only, also adding that will make it maybe too complex?

@cmeeren Do you also want to add a similar note to the tutorial docs? Somewhere here: http://pandas.pydata.org/pandas-docs/stable/io.html#date-parsing-functions That section could also use a real example I think (of a custom defined function, not of one imported from the io.date_converters module)

@cmeeren
Copy link
Contributor Author

cmeeren commented Jan 30, 2015

@jorisvandenbossche Just to be clear concerning the second try, where you say it concatenates the columns into one column: If you use two columns, one with values [2013, 2013, 2013] and one with values [1, 2, 3], will the second try pass the single argument [2013, 2013, 2013, 1, 2, 3]?

@cmeeren
Copy link
Contributor Author

cmeeren commented Jan 30, 2015

I have now mentioned the second way of calling date_parser (assuming my guess in the previous comment was correct) and added a description to the tutorial docs. I have not touched the example, since I have little experience with the Sphinx-IPython combo.

@jreback
Copy link
Contributor

jreback commented Jan 30, 2015

@cmeeren actually, I think you should mention pd.to_datetime() first. If you want to specify a format, read_csv does not currently have this implemented (it can infer the format however if infer_datetime_format=True, but is False by default).

So it is MUCH more performant to use pd.to_datetime() AFTER parsing if you have a single column, but it needs a format specification. IOW, you should NOT use date_parser if this is the case. (In reality read_csv should just do this, but it is an open issue).

@cmeeren
Copy link
Contributor Author

cmeeren commented Jan 30, 2015

@jreback are you suggesting that I mention pd.to_datetime() as a possible function to use for date_parser?

@jreback
Copy link
Contributor

jreback commented Jan 30, 2015

no! this is in lieu of using date_parser entirely

@cmeeren
Copy link
Contributor Author

cmeeren commented Jan 31, 2015

I don't really understand the role of pd.to_datetime() in this scenario. Why do you suggest that I mention it in the documentation for date_parser? Could you give an example of how pd.to_datetime() might be used instead of date_parser?

@jorisvandenbossche
Copy link
Member

The concatenation is like this in your example: ["2013 1", "2013 2", "2013 3"].

@jreback In many cases you are right about to_datetime, and I also think read_csv should take a date_format argument (#2586)

@jorisvandenbossche
Copy link
Member

@jreback finishing my comment: indeed, in many case to_datetime will be better, but I think you specifically want to use date_parser when you have multiple columns that have to be combined.

BTW, you can use pd.to_datetime as a function for date_parser, no? (if you still want to do it with a one-liner)

@jreback
Copy link
Contributor

jreback commented Jan 31, 2015

so the simple heuristic is this:

  • if you have multiple columns that need parsing, use parse_dates=[[....]].
  • try to infer the format read_csv(..., infer_datetime_format=True)
  • if you have a format, the use date_parser=lambda x: pd.to_datetime(x, format=.....)
  • if you have a really non-standard format, finally use date_parser=.....

so a naked date_parser is ALWAYS the last resort (as unless it can handle a vectorized input, its in python space).

@jreback jreback added IO CSV read_csv, to_csv Datetime Datetime data dtype labels Jan 31, 2015
@jreback
Copy link
Contributor

jreback commented Jan 31, 2015

@cmeeren so what I think we should do is update the doc-string a bit (what you have is prob good). then add a section to the docs in the date parsing section giving the relative list as above.

@jorisvandenbossche
Copy link
Member

@jreback yes, that is a nice overview of steps to follow!

The full docs on the date parsing can use an overhaul. It is now scattered a bit:

where the first is not adjacent to the other three. I would have just one section with some subsections.

@cmeeren If you want, you can certainly try to tackle this! Otherwise, we just merge this as is (as it is correct information) and open a new issue for it.

@cmeeren
Copy link
Contributor Author

cmeeren commented Jan 31, 2015

First I'll correct things based on the recent feedback here. The information I added as it currently stands is not correct (specifically regarding the second "concatenation" call).

@cmeeren
Copy link
Contributor Author

cmeeren commented Jan 31, 2015

I added the list provided by @jreback. Please check the PR diff now and see if the information is good.

an exception is raised, the next one is tried:

1. ``date_parser`` is first called with one or more arrays as arguments,
as defined using `parse_dates` (e.g., ``date_parser(['2013', '2013']``, ``['1', '2'])``)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the ```` in the middle of the date_parser(..) example should be removed I think

@jorisvandenbossche
Copy link
Member

This looks good for me!

Apart from the one small comment, can you also squash your commits into one?

@cmeeren
Copy link
Contributor Author

cmeeren commented Feb 1, 2015

I addressed the comment and I think I managed to squash the commits now.

@jorisvandenbossche
Copy link
Member

Thanks a lot!

jorisvandenbossche added a commit that referenced this pull request Feb 1, 2015
DOC: Clarify how date_parser is called (GH9376)
@jorisvandenbossche jorisvandenbossche merged commit ef48c6f into pandas-dev:master Feb 1, 2015
@jreback jreback mentioned this pull request Mar 5, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Docs IO CSV read_csv, to_csv
Projects
None yet
3 participants