-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Inconsistent date parsing of to_datetime #42908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent date parsing of to_datetime #42908
Conversation
@MarcoGorelli this looks neat. prob need to troll all of the issues and add every test you can find :-> |
f79d8f7
to
0744ced
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice looks good. Just needs a whatsnew note for 1.4 (probably good to have a dedicated section demoing this behavior), and maybe the to_datetime
docstring can explain when a warning will be raised.
Sure, added, here's the docstring: Unfortunately, this doesn't solve the case of datetime strings, just date strings. E.g.: In [2]: pd.to_datetime(['31-12-2021'])
<ipython-input-2-a6d9926683b2>:1: UserWarning: Parsing '31-12-2021' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing.
pd.to_datetime(['31-12-2021'])
Out[2]: DatetimeIndex(['2021-12-31'], dtype='datetime64[ns]', freq=None)
In [3]: pd.to_datetime(['31-12-2021 12:00:31'])
Out[3]: DatetimeIndex(['2021-12-31 12:00:31'], dtype='datetime64[ns]', freq=None) because those cases go straight to dateutil: pandas/pandas/_libs/tslibs/parsing.pyx Lines 238 to 244 in 08d296f
and this PR only addresses what reaches
As in, in the "notable bug fixes" section? If so, I've added it there, here's how it looks: |
pandas/pandas/_libs/tslibs/parsing.pyx Lines 238 to 244 in 08d296f
yeah would be nice to warn in this as well |
Agreed, just not sure how to do it - all I can think of is dt = du_parse(date_string, default=_DEFAULT_DATETIME,
dayfirst=dayfirst, yearfirst=yearfirst, **kwargs)
if dayfirst and not re.search(rf'(?<!\d){dt.day}(?!\d)', date_string).start() < re.search(rf'(?<!\d){dt.month}(?!\d)', date_string).start():
warnings.warn(
PARSING_WARNING_MSG.format(
date_string=date_string,
format='MM/DD/YYYY'
),
stacklevel=4,
)
elif not dayfirst and re.search(rf'(?<!\d){dt.day}(?!\d)', date_string).start() < re.search(rf'(?<!\d){dt.month}(?!\d)', date_string).start():
warnings.warn(
PARSING_WARNING_MSG.format(
date_string=date_string,
format='DD/MM/YYYY'
),
stacklevel=4,
) would that be OK? As in, parse EDITThe above wouldn't work, as it's possible to specify a date as both So, I can't think (at the moment) how to do this - any suggestions? |
well the bigger problem here is that we really should warn anytime this is called I am happy to merge this and to do followups, but ideally want to warn if at all possible that something is amiss (even with false positives). |
|
||
notable_bug_fix1 | ||
^^^^^^^^^^^^^^^^ | ||
Inconsistent date string parsing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also add to the documentation itself, e.g. somewhere in timeseries.rst
is appropriate
pandas/core/tools/datetimes.py
Outdated
.. warning:: | ||
|
||
dayfirst=True is not strict, but will prefer to parse | ||
with day first (this is a known bug, based on dateutil behavior). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wouldn't say this is a known bug unless you can point to an authoritative reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't add this, it's from #7599 , I just put it into a warning block. Reckon it should be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah +1 to remove this line unless we can link the Github issue number.
yup, I've added the same suite of tests for that
I gave that a go but it's a lot more complicated and there's a ton more warnings to catch. I've pushed that work to a branch (MarcoGorelli:warn-on-du-parse), but for now I've kept this PR to just the delimited date case |
…etime-inconsistent-parsing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small comment about removing the comment about the dateutil bug and then feel free to merge.
Thanks @MarcoGorelli. Happy to have a follow up with warnings around |
@MarcoGorelli did this also close #43164 , where the |
I just tried that and got the same output you got (that say, I haven't looked into |
* added warnings when parse inconsistent with dayfirst arg * improved error message * TST: added tests * removed trailing whitespaces * removed pytest.warns * wip * revert * set stacklevel, assert warning messages * okwarning in user guide * 🎨 * catch warnings * fixup * add to to_datetime docstring, add whatsnew note * wip * wip * wip * wip * fixup test * more fixups * fixup * revert to b4bb5b3 * document in timeseries.rst * add tests for read_csv * check expected_inconsistent in tests * fixup docs * remove note about dateutil bug Co-authored-by: arw2019 <[email protected]>
This reverts commit 36e4165.
carries on from #35428