Skip to content

PDEP0004: implementation #49024

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 54 commits into from
Dec 13, 2022
Merged
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
ea79669
:wastebasket: deprecate infer_datetime_format, make strict
Oct 18, 2022
bb68cc3
:rotating_light: add warning about dayfirst
Oct 18, 2022
82266f4
:white_check_mark: add/update tests
Oct 18, 2022
4a6f198
:rotating_light: add warning if format cant be guessed
Oct 18, 2022
5568dca
:goal_net: catch warnings
Oct 18, 2022
bc910b0
:memo: update docs
Oct 18, 2022
7d03503
:memo: add example of reading csv file with mixed formats
Oct 19, 2022
ac825f5
:wastebasket: removed now outdated tests / clean inputs
Oct 19, 2022
2ffcef6
:memo: clarify whatsnew and user-guide
Oct 21, 2022
060835d
Merge remote-tracking branch 'upstream/main' into implementation-pdep-4
Oct 21, 2022
1d9f274
Merge branch 'main' into implementation-pdep-4
MarcoGorelli Oct 28, 2022
b3e32ac
:art:
Oct 28, 2022
22417cf
Merge remote-tracking branch 'upstream/main' into implementation-pdep-4
Oct 29, 2022
d3adfe5
guess %Y-%m format
Oct 29, 2022
affa7f3
Detect format from first non-na, but also exclude now and today
Oct 29, 2022
575b215
:white_check_mark: fixup tests based on now and today parsing
Oct 29, 2022
f0e83da
Merge remote-tracking branch 'upstream/main' into implementation-pdep-4
Oct 29, 2022
a5ff448
Merge remote-tracking branch 'upstream/main' into implementation-pdep-4
Nov 12, 2022
68a6ea2
Merge remote-tracking branch 'upstream/main' into implementation-pdep-4
Nov 15, 2022
6661ae3
Merge remote-tracking branch 'upstream/main' into implementation-pdep-4
Nov 17, 2022
1d255e0
fixup after merge
Nov 17, 2022
b3aa585
Merge remote-tracking branch 'upstream/main' into implementation-pdep-4
Nov 17, 2022
285b1ff
fixup after merge
Nov 17, 2022
963b62b
fixup test
Nov 17, 2022
c90a8a5
remove outdated doctest
Nov 17, 2022
3c033ff
Merge remote-tracking branch 'upstream/main' into implementation-pdep-4
Nov 19, 2022
cdfa355
xfail test based on issue 49767
Nov 19, 2022
434c6f0
Merge remote-tracking branch 'upstream/main' into implementation-pdep-4
Dec 2, 2022
5755032
wip
Dec 2, 2022
96c0653
Merge remote-tracking branch 'upstream/main' into implementation-pdep-4
Dec 3, 2022
9f1c18e
Merge remote-tracking branch 'upstream/main' into implementation-pdep-4
Dec 3, 2022
0a86705
add back examples of formats which can be guessed
Dec 3, 2022
7b4d6be
Merge remote-tracking branch 'upstream/main' into implementation-pdep-4
Dec 6, 2022
86e9bcf
start fixing up
Dec 6, 2022
f92a8cb
fixups from reviews
Dec 6, 2022
fd215df
lint
Dec 6, 2022
0a5c466
put tests back
Dec 6, 2022
772dd6c
shorten diff
Dec 6, 2022
b49b7cf
add example of string which cannot be guessed
Dec 6, 2022
17f5e74
Merge remote-tracking branch 'upstream/main' into implementation-pdep-4
Dec 6, 2022
d17d819
add deprecated directive, construct expected explicitly, explicit Use…
Dec 6, 2022
f4520e9
remove redundant example
Dec 6, 2022
fcb515f
restore newline
Dec 6, 2022
78b4b9e
Merge remote-tracking branch 'upstream/main' into implementation-pdep-4
Dec 9, 2022
2215652
double backticks around False, explicitly raise UserWarning
Dec 9, 2022
1ec70db
Merge branch 'main' into implementation-pdep-4
MarcoGorelli Dec 10, 2022
7b0eb99
Merge remote-tracking branch 'upstream/main' into implementation-pdep-4
Dec 12, 2022
7d11f59
reword warning
Dec 12, 2022
30e6f39
Merge remote-tracking branch 'upstream/main' into implementation-pdep-4
Dec 12, 2022
f0ac458
test both dayfirst True and False
Dec 12, 2022
92ef7e2
Merge remote-tracking branch 'upstream/main' into implementation-pdep-4
Dec 13, 2022
4a5dd1c
postmerge fixup
Dec 13, 2022
917b31b
unimportant typo to restart CI
Dec 13, 2022
135bbb5
Merge branch 'main' into implementation-pdep-4
MarcoGorelli Dec 13, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/user_guide/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2284,6 +2284,7 @@ useful if you are reading in data which is mostly of the desired dtype (e.g. num
non-conforming elements intermixed that you want to represent as missing:

.. ipython:: python
:okwarning:

import datetime

Expand All @@ -2300,6 +2301,7 @@ The ``errors`` parameter has a third option of ``errors='ignore'``, which will s
encounters any errors with the conversion to a desired data type:

.. ipython:: python
:okwarning:

import datetime

Expand Down
33 changes: 19 additions & 14 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -968,17 +968,7 @@ To parse the mixed-timezone values as a datetime column, pass a partially-applie
Inferring datetime format
+++++++++++++++++++++++++

If you have ``parse_dates`` enabled for some or all of your columns, and your
datetime strings are all formatted the same way, you may get a large speed
up by setting ``infer_datetime_format=True``. If set, pandas will attempt
to guess the format of your datetime strings, and then use a faster means
of parsing the strings. 5-10x parsing speeds have been observed. pandas
will fallback to the usual parsing if either the format cannot be guessed
or the format that was guessed cannot properly parse the entire column
of strings. So in general, ``infer_datetime_format`` should not have any
negative consequences if enabled.

Here are some examples of datetime strings that can be guessed (All
Here are some examples of datetime strings that can be guessed (all
representing December 30th, 2011 at 00:00:00):

* "20111230"
Expand All @@ -988,21 +978,36 @@ representing December 30th, 2011 at 00:00:00):
* "30/Dec/2011 00:00:00"
* "30/December/2011 00:00:00"

Note that ``infer_datetime_format`` is sensitive to ``dayfirst``. With
Note that format inference is sensitive to ``dayfirst``. With
``dayfirst=True``, it will guess "01/12/2011" to be December 1st. With
``dayfirst=False`` (default) it will guess "01/12/2011" to be January 12th.

If you try to parse a column of date strings, pandas will attempt to guess the format
from the first non-NaN element, and will then parse the rest of the column with that
format. If pandas fails to guess the format (for example if your first string is
``'01 December US/Pacific 2000'``), then a warning will be raised and each
row will be parsed individually by ``dateutil.parser.parse``. The safest
way to parse dates is to explicitly set ``format=``.

.. ipython:: python

# Try to infer the format for the index column
df = pd.read_csv(
"foo.csv",
index_col=0,
parse_dates=True,
infer_datetime_format=True,
)
df

In the case that you have mixed datetime formats within the same column, you'll need to
first read it in as an object dtype and then apply :func:`to_datetime` to each element.

.. ipython:: python

data = io.StringIO("date\n12 Jan 2000\n2000-01-13\n")
df = pd.read_csv(data)
df['date'] = df['date'].apply(pd.to_datetime)
df

.. ipython:: python
:suppress:

Expand Down
14 changes: 7 additions & 7 deletions doc/source/user_guide/timeseries.rst
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,8 @@ time.

.. ipython:: python

import datetime

pd.Timestamp(datetime.datetime(2012, 5, 1))
pd.Timestamp("2012-05-01")
pd.Timestamp(2012, 5, 1)
Expand Down Expand Up @@ -196,26 +198,24 @@ is converted to a ``DatetimeIndex``:

.. ipython:: python

pd.to_datetime(pd.Series(["Jul 31, 2009", "2010-01-10", None]))
pd.to_datetime(pd.Series(["Jul 31, 2009", "Jan 10, 2010", None]))

pd.to_datetime(["2005/11/23", "2010.12.31"])
pd.to_datetime(["2005/11/23", "2010/12/31"])

If you use dates which start with the day first (i.e. European style),
you can pass the ``dayfirst`` flag:

.. ipython:: python
:okwarning:
:okwarning:

pd.to_datetime(["04-01-2012 10:00"], dayfirst=True)

pd.to_datetime(["14-01-2012", "01-14-2012"], dayfirst=True)
pd.to_datetime(["04-14-2012 10:00"], dayfirst=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this line still necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's an example demonstrating that dayfirst isn't strict

but it's good you've highlighted this, as the blank line I'd removed was preventing it from rendering properly. now it looks fine:

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel thanks for your review - any other objections? Sorry to tag, just hoping to move this forwards before more merge conflicts arise

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all good here. ill take another look tomorrowish if this is still up, but if it is merged before then i wont complain


.. warning::

You see in the above example that ``dayfirst`` isn't strict. If a date
can't be parsed with the day being first it will be parsed as if
``dayfirst`` were False, and in the case of parsing delimited date strings
(e.g. ``31-12-2012``) then a warning will also be raised.
``dayfirst`` were False and a warning will also be raised.

If you pass a single string to ``to_datetime``, it returns a single ``Timestamp``.
``Timestamp`` can also accept string input, but it doesn't accept string parsing
Expand Down
34 changes: 33 additions & 1 deletion doc/source/whatsnew/v2.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -342,6 +342,38 @@ Optional libraries below the lowest tested version may still work, but are not c

See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.

Datetimes are now parsed with a consistent format
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In the past, :func:`to_datetime` guessed the format for each element independently. This was appropriate for some cases where elements had mixed date formats - however, it would regularly cause problems when users expected a consistent format but the function would switch formats between elements. As of version 2.0.0, parsing will use a consistent format, determined by the first non-NA value (unless the user specifies a format, in which case that is used).

*Old behavior*:

.. code-block:: ipython

In [1]: ser = pd.Series(['13-01-2000', '12-01-2000'])
In [2]: pd.to_datetime(ser)
Out[2]:
0 2000-01-13
1 2000-12-01
dtype: datetime64[ns]

*New behavior*:

.. ipython:: python
:okwarning:

ser = pd.Series(['13-01-2000', '12-01-2000'])
pd.to_datetime(ser)

Note that this affects :func:`read_csv` as well.

If you still need to parse dates with inconsistent formats, you'll need to apply :func:`to_datetime`
to each element individually, e.g. ::

ser = pd.Series(['13-01-2000', '12 January 2000'])
ser.apply(pd.to_datetime)

.. _whatsnew_200.api_breaking.other:

Other API changes
Expand Down Expand Up @@ -378,7 +410,7 @@ Other API changes

Deprecations
~~~~~~~~~~~~
-
- Deprecated argument ``infer_datetime_format`` in :func:`to_datetime` and :func:`read_csv`, as a strict version of it is now the default (:issue:`48621`)

.. ---------------------------------------------------------------------------

Expand Down
21 changes: 21 additions & 0 deletions pandas/_libs/tslibs/parsing.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -1052,6 +1052,7 @@ def guess_datetime_format(dt_str: str, bint dayfirst=False) -> str | None:
# rebuild string, capturing any inferred padding
dt_str = "".join(tokens)
if parsed_datetime.strftime(guessed_format) == dt_str:
_maybe_warn_about_dayfirst(guessed_format, dayfirst)
return guessed_format
else:
return None
Expand All @@ -1071,6 +1072,26 @@ cdef str _fill_token(token: str, padding: int):
token_filled = f"{seconds}.{nanoseconds}"
return token_filled

cdef void _maybe_warn_about_dayfirst(format: str, bint dayfirst):
"""Warn if guessed datetime format doesn't respect dayfirst argument."""
cdef:
int day_index = format.find("%d")
int month_index = format.find("%m")

if (day_index != -1) and (month_index != -1):
if (day_index > month_index) and dayfirst:
warnings.warn(
f"Parsing dates in {format} format when dayfirst=True was specified. "
"Pass `dayfirst=False` or specify a format to silence this warning.",
stacklevel=find_stack_level(),
)
if (day_index < month_index) and not dayfirst:
warnings.warn(
f"Parsing dates in {format} format when dayfirst=False was specified. "
"Pass `dayfirst=True` or specify a format to silence this warning.",
stacklevel=find_stack_level(),
)


@cython.wraparound(False)
@cython.boundscheck(False)
Expand Down
Loading