Skip to content

Inconsistent date parsing of to_datetime #42908

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Aug 5, 2021

carries on from #35428

@MarcoGorelli MarcoGorelli changed the title Pr/arw2019/to datetime inconsistent parsing Inconsistent date parsing of to_datetime Aug 5, 2021
@MarcoGorelli MarcoGorelli marked this pull request as draft August 5, 2021 20:07
@jreback
Copy link
Contributor

jreback commented Aug 5, 2021

@MarcoGorelli this looks neat. prob need to troll all of the issues and add every test you can find :->

@jreback jreback added the Datetime Datetime data dtype label Aug 5, 2021
@MarcoGorelli MarcoGorelli force-pushed the pr/arw2019/to_datetime-inconsistent-parsing branch from f79d8f7 to 0744ced Compare August 6, 2021 08:56
@MarcoGorelli MarcoGorelli marked this pull request as ready for review August 6, 2021 10:49
@MarcoGorelli MarcoGorelli marked this pull request as draft August 6, 2021 10:58
@MarcoGorelli MarcoGorelli marked this pull request as ready for review August 6, 2021 13:41
@MarcoGorelli MarcoGorelli requested a review from mroeschke August 6, 2021 13:42
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice looks good. Just needs a whatsnew note for 1.4 (probably good to have a dedicated section demoing this behavior), and maybe the to_datetime docstring can explain when a warning will be raised.

@MarcoGorelli
Copy link
Member Author

Sure, added, here's the docstring:

image

Unfortunately, this doesn't solve the case of datetime strings, just date strings. E.g.:

In [2]: pd.to_datetime(['31-12-2021'])
<ipython-input-2-a6d9926683b2>:1: UserWarning: Parsing '31-12-2021' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing.
  pd.to_datetime(['31-12-2021'])
Out[2]: DatetimeIndex(['2021-12-31'], dtype='datetime64[ns]', freq=None)

In [3]: pd.to_datetime(['31-12-2021 12:00:31'])
Out[3]: DatetimeIndex(['2021-12-31 12:00:31'], dtype='datetime64[ns]', freq=None)

because those cases go straight to dateutil:

if does_string_look_like_time(date_string):
# use current datetime as default, not pass _DEFAULT_DATETIME
dt = du_parse(date_string, dayfirst=dayfirst,
yearfirst=yearfirst, **kwargs)
return dt
dt, _ = _parse_delimited_date(date_string, dayfirst)

and this PR only addresses what reaches _parse_delimited_date.

Just needs a whatsnew note for 1.4 (probably good to have a dedicated section demoing this behavior),

As in, in the "notable bug fixes" section? If so, I've added it there, here's how it looks:

image

@jreback
Copy link
Contributor

jreback commented Aug 8, 2021

if does_string_look_like_time(date_string):
# use current datetime as default, not pass _DEFAULT_DATETIME
dt = du_parse(date_string, dayfirst=dayfirst,
yearfirst=yearfirst, **kwargs)
return dt
dt, _ = _parse_delimited_date(date_string, dayfirst)

yeah would be nice to warn in this as well

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Aug 8, 2021

yeah would be nice to warn in this as well

Agreed, just not sure how to do it - all I can think of is

        dt = du_parse(date_string, default=_DEFAULT_DATETIME,
                      dayfirst=dayfirst, yearfirst=yearfirst, **kwargs)
        if dayfirst and not re.search(rf'(?<!\d){dt.day}(?!\d)', date_string).start() < re.search(rf'(?<!\d){dt.month}(?!\d)', date_string).start():
            warnings.warn(
                PARSING_WARNING_MSG.format(
                    date_string=date_string,
                    format='MM/DD/YYYY'
                ),
                stacklevel=4,
            )
        elif not dayfirst and re.search(rf'(?<!\d){dt.day}(?!\d)', date_string).start() < re.search(rf'(?<!\d){dt.month}(?!\d)', date_string).start():
            warnings.warn(
                PARSING_WARNING_MSG.format(
                    date_string=date_string,
                    format='DD/MM/YYYY'
                ),
                stacklevel=4,
            )

would that be OK?

As in, parse date_string, and check for the first occurrence of the parsed day and month properties

EDIT

The above wouldn't work, as it's possible to specify a date as both 01 and 1 in the date string.


So, I can't think (at the moment) how to do this - any suggestions?

@jreback
Copy link
Contributor

jreback commented Aug 8, 2021

well the bigger problem here is that we really should warn anytime this is called du_parse (if we are not guessing a format) and actually parsing, e.g. this is the fallback case.

I am happy to merge this and to do followups, but ideally want to warn if at all possible that something is amiss (even with false positives).


notable_bug_fix1
^^^^^^^^^^^^^^^^
Inconsistent date string parsing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add to the documentation itself, e.g. somewhere in timeseries.rst is appropriate

.. warning::

dayfirst=True is not strict, but will prefer to parse
with day first (this is a known bug, based on dateutil behavior).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wouldn't say this is a known bug unless you can point to an authoritative reference.

Copy link
Member Author

@MarcoGorelli MarcoGorelli Aug 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't add this, it's from #7599 , I just put it into a warning block. Reckon it should be removed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah +1 to remove this line unless we can link the Github issue number.

@jreback jreback added this to the 1.4 milestone Aug 8, 2021
@jreback jreback added the IO CSV read_csv, to_csv label Aug 8, 2021
@MarcoGorelli MarcoGorelli marked this pull request as ready for review August 22, 2021 18:18
@MarcoGorelli MarcoGorelli marked this pull request as draft August 22, 2021 18:40
@MarcoGorelli MarcoGorelli marked this pull request as ready for review August 22, 2021 18:46
@MarcoGorelli
Copy link
Member Author

does this show any warnings during csv parsing? (i think it should in the same cases).

yup, I've added the same suite of tests for that

well the bigger problem here is that we really should warn anytime this is called du_parse (if we are not guessing a format) and actually parsing, e.g. this is the fallback case.

I gave that a go but it's a lot more complicated and there's a ton more warnings to catch. I've pushed that work to a branch (MarcoGorelli:warn-on-du-parse), but for now I've kept this PR to just the delimited date case

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small comment about removing the comment about the dateutil bug and then feel free to merge.

@mroeschke mroeschke merged commit 36e4165 into pandas-dev:master Aug 26, 2021
@mroeschke
Copy link
Member

Thanks @MarcoGorelli. Happy to have a follow up with warnings around du_parse

@attack68
Copy link
Contributor

@MarcoGorelli did this also close #43164 , where the pandas.options seem ignored in this regard?

@MarcoGorelli MarcoGorelli deleted the pr/arw2019/to_datetime-inconsistent-parsing branch August 27, 2021 08:27
@MarcoGorelli
Copy link
Member Author

I just tried that and got the same output you got (that say, I haven't looked into pandas.options much and don't know what that option's meant to do, I hadn't seen it before)

feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021
* added warnings when parse inconsistent with dayfirst arg

* improved error message

* TST: added tests

* removed trailing whitespaces

* removed pytest.warns

* wip

* revert

* set stacklevel, assert warning messages

* okwarning in user guide

* 🎨

* catch warnings

* fixup

* add to to_datetime docstring, add whatsnew note

* wip

* wip

* wip

* wip

* fixup test

* more fixups

* fixup

* revert to b4bb5b3

* document in timeseries.rst

* add tests for read_csv

* check expected_inconsistent in tests

* fixup docs

* remove note about dateutil bug

Co-authored-by: arw2019 <[email protected]>
MarcoGorelli pushed a commit to MarcoGorelli/pandas that referenced this pull request Dec 13, 2022
MarcoGorelli added a commit that referenced this pull request Dec 14, 2022
…can't do anything about it (#50232)

* Revert "Inconsistent date parsing of to_datetime (#42908)"

This reverts commit 36e4165.

* post-merge fixup

* add test

* whatsnew

Co-authored-by: MarcoGorelli <>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants