Skip to content

Commit 2ea036f

Browse files
authored
ENH/WIP: resolution inference in pd.to_datetime, DatetimeIndex (#55901)
* ENH: read_stata return non-nano * GH ref * move whatsnew * remove outdated whatsnew * ENH: read_stata return non-nano * avoid Series.view * dont go through Series * TST: dt64 units * BUG: cut with non-nano * BUG: round with non-nanosecond raising OverflowError * woops * BUG: cut with non-nano * TST: parametrize tests over dt64 unit * xfail non-nano * revert * BUG: mixed-type mixed-timezone/awareness * commit so i can unstash something else i hope * ENH: infer resolution in to_datetime, DatetimeIndex * revert commented-out * revert commented-out * revert commented-out * remove commented-out * remove comment * revert unnecessary * revert unnecessary * fix window tests * Fix resample tests * restore comment * revert unnecessary * remove no-longer necessary * revert no-longer-necessary * revert no-longer-necessary * update tests * revert no-longer-necessary * update tests * revert bits * update tests * cleanup * revert * revert * parametrize over unit * update tests * update tests * revert no-longer-needed * revert no-longer-necessary * revert no-longer-necessary * revert no-longer-necessary * revert no-longer-necessary * Revert no-longer-necessary * update test * update test * simplify * update tests * update tests * update tests * revert no-longer-necessary * post-merge fixup * revert no-longer-necessary * update tests * update test * update tests * update tests * remove commented-out * revert no-longer-necessary * as_unit->astype * cleanup * merge fixup * revert bit * revert no-longer-necessary, xfail * update multithread test * update tests * update doctest * update tests * update doctests * update tests * update db tests * troubleshoot db tests * update test * troubleshoot sql tests * update test * update tests * mypy fixup * Update test * kludge test * update test * update for min-version tests * fix adbc check * troubleshoot minimum version deps * troubleshoot * troubleshoot * troubleshoot * whatsnew * update abdc-driver-postgresql minimum version * update doctest * fix doc example * troubleshoot test_api_custom_dateparsing_error * troubleshoot * troubleshoot * troubleshoot * troubleshoot * troubleshoot * troubleshoot * update exp instead of object cast * revert accidental * simplify test
1 parent a2a78d3 commit 2ea036f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

77 files changed

+745
-457
lines changed

Diff for: doc/source/whatsnew/v3.0.0.rst

+63
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,69 @@ notable_bug_fix2
124124
Backwards incompatible API changes
125125
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
126126

127+
.. _whatsnew_300.api_breaking.datetime_resolution_inference:
128+
129+
Datetime resolution inference
130+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
131+
132+
Converting a sequence of strings, ``datetime`` objects, or ``np.datetime64`` objects to
133+
a ``datetime64`` dtype now performs inference on the appropriate resolution (AKA unit) for the output dtype. This affects :class:`Series`, :class:`DataFrame`, :class:`Index`, :class:`DatetimeIndex`, and :func:`to_datetime`.
134+
135+
Previously, these would always give nanosecond resolution:
136+
137+
.. code-block:: ipython
138+
139+
In [1]: dt = pd.Timestamp("2024-03-22 11:36").to_pydatetime()
140+
In [2]: pd.to_datetime([dt]).dtype
141+
Out[2]: dtype('<M8[ns]')
142+
In [3]: pd.Index([dt]).dtype
143+
Out[3]: dtype('<M8[ns]')
144+
In [4]: pd.DatetimeIndex([dt]).dtype
145+
Out[4]: dtype('<M8[ns]')
146+
In [5]: pd.Series([dt]).dtype
147+
Out[5]: dtype('<M8[ns]')
148+
149+
This now infers the unit microsecond unit "us" from the pydatetime object, matching the scalar :class:`Timestamp` behavior.
150+
151+
.. ipython:: python
152+
153+
In [1]: dt = pd.Timestamp("2024-03-22 11:36").to_pydatetime()
154+
In [2]: pd.to_datetime([dt]).dtype
155+
In [3]: pd.Index([dt]).dtype
156+
In [4]: pd.DatetimeIndex([dt]).dtype
157+
In [5]: pd.Series([dt]).dtype
158+
159+
Similar when passed a sequence of ``np.datetime64`` objects, the resolution of the passed objects will be retained (or for lower-than-second resolution, second resolution will be used).
160+
161+
When passing strings, the resolution will depend on the precision of the string, again matching the :class:`Timestamp` behavior. Previously:
162+
163+
.. code-block:: ipython
164+
165+
In [2]: pd.to_datetime(["2024-03-22 11:43:01"]).dtype
166+
Out[2]: dtype('<M8[ns]')
167+
In [3]: pd.to_datetime(["2024-03-22 11:43:01.002"]).dtype
168+
Out[3]: dtype('<M8[ns]')
169+
In [4]: pd.to_datetime(["2024-03-22 11:43:01.002003"]).dtype
170+
Out[4]: dtype('<M8[ns]')
171+
In [5]: pd.to_datetime(["2024-03-22 11:43:01.002003004"]).dtype
172+
Out[5]: dtype('<M8[ns]')
173+
174+
The inferred resolution now matches that of the input strings:
175+
176+
.. ipython:: python
177+
178+
In [2]: pd.to_datetime(["2024-03-22 11:43:01"]).dtype
179+
In [3]: pd.to_datetime(["2024-03-22 11:43:01.002"]).dtype
180+
In [4]: pd.to_datetime(["2024-03-22 11:43:01.002003"]).dtype
181+
In [5]: pd.to_datetime(["2024-03-22 11:43:01.002003004"]).dtype
182+
183+
In cases with mixed-resolution inputs, the highest resolution is used:
184+
185+
.. code-block:: ipython
186+
187+
In [2]: pd.to_datetime([pd.Timestamp("2024-03-22 11:43:01"), "2024-03-22 11:43:01.002"]).dtype
188+
Out[2]: dtype('<M8[ns]')
189+
127190
.. _whatsnew_300.api_breaking.deps:
128191

129192
Increased minimum versions for dependencies

Diff for: pandas/_libs/lib.pyx

+2-8
Original file line numberDiff line numberDiff line change
@@ -96,16 +96,12 @@ from pandas._libs.missing cimport (
9696
is_null_datetime64,
9797
is_null_timedelta64,
9898
)
99-
from pandas._libs.tslibs.conversion cimport (
100-
_TSObject,
101-
convert_to_tsobject,
102-
)
99+
from pandas._libs.tslibs.conversion cimport convert_to_tsobject
103100
from pandas._libs.tslibs.nattype cimport (
104101
NPY_NAT,
105102
c_NaT as NaT,
106103
checknull_with_nat,
107104
)
108-
from pandas._libs.tslibs.np_datetime cimport NPY_FR_ns
109105
from pandas._libs.tslibs.offsets cimport is_offset_object
110106
from pandas._libs.tslibs.period cimport is_period_object
111107
from pandas._libs.tslibs.timedeltas cimport convert_to_timedelta64
@@ -2497,7 +2493,6 @@ def maybe_convert_objects(ndarray[object] objects,
24972493
ndarray[uint8_t] mask
24982494
Seen seen = Seen()
24992495
object val
2500-
_TSObject tsobj
25012496
float64_t fnan = NaN
25022497

25032498
if dtype_if_all_nat is not None:
@@ -2604,8 +2599,7 @@ def maybe_convert_objects(ndarray[object] objects,
26042599
else:
26052600
seen.datetime_ = True
26062601
try:
2607-
tsobj = convert_to_tsobject(val, None, None, 0, 0)
2608-
tsobj.ensure_reso(NPY_FR_ns)
2602+
convert_to_tsobject(val, None, None, 0, 0)
26092603
except OutOfBoundsDatetime:
26102604
# e.g. test_out_of_s_bounds_datetime64
26112605
seen.object_ = True

Diff for: pandas/_libs/tslib.pyx

+11-6
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,10 @@ from pandas._libs.tslibs.conversion cimport (
6363
get_datetime64_nanos,
6464
parse_pydatetime,
6565
)
66-
from pandas._libs.tslibs.dtypes cimport npy_unit_to_abbrev
66+
from pandas._libs.tslibs.dtypes cimport (
67+
get_supported_reso,
68+
npy_unit_to_abbrev,
69+
)
6770
from pandas._libs.tslibs.nattype cimport (
6871
NPY_NAT,
6972
c_nat_strings as nat_strings,
@@ -260,7 +263,7 @@ cpdef array_to_datetime(
260263
bint dayfirst=False,
261264
bint yearfirst=False,
262265
bint utc=False,
263-
NPY_DATETIMEUNIT creso=NPY_FR_ns,
266+
NPY_DATETIMEUNIT creso=NPY_DATETIMEUNIT.NPY_FR_GENERIC,
264267
str unit_for_numerics=None,
265268
):
266269
"""
@@ -288,8 +291,8 @@ cpdef array_to_datetime(
288291
yearfirst parsing behavior when encountering datetime strings
289292
utc : bool, default False
290293
indicator whether the dates should be UTC
291-
creso : NPY_DATETIMEUNIT, default NPY_FR_ns
292-
Set to NPY_FR_GENERIC to infer a resolution.
294+
creso : NPY_DATETIMEUNIT, default NPY_FR_GENERIC
295+
If NPY_FR_GENERIC, conduct inference.
293296
unit_for_numerics : str, default "ns"
294297
295298
Returns
@@ -389,7 +392,7 @@ cpdef array_to_datetime(
389392
# GH#32264 np.str_ object
390393
val = str(val)
391394

392-
if parse_today_now(val, &iresult[i], utc, creso):
395+
if parse_today_now(val, &iresult[i], utc, creso, infer_reso=infer_reso):
393396
# We can't _quite_ dispatch this to convert_str_to_tsobject
394397
# bc there isn't a nice way to pass "utc"
395398
item_reso = NPY_DATETIMEUNIT.NPY_FR_us
@@ -533,7 +536,9 @@ def array_to_datetime_with_tz(
533536
if state.creso_ever_changed:
534537
# We encountered mismatched resolutions, need to re-parse with
535538
# the correct one.
536-
return array_to_datetime_with_tz(values, tz=tz, creso=creso)
539+
return array_to_datetime_with_tz(
540+
values, tz=tz, dayfirst=dayfirst, yearfirst=yearfirst, creso=creso
541+
)
537542
elif creso == NPY_DATETIMEUNIT.NPY_FR_GENERIC:
538543
# i.e. we never encountered anything non-NaT, default to "s". This
539544
# ensures that insert and concat-like operations with NaT

Diff for: pandas/_libs/tslibs/strptime.pyx

+3-3
Original file line numberDiff line numberDiff line change
@@ -354,7 +354,7 @@ def array_strptime(
354354
bint exact=True,
355355
errors="raise",
356356
bint utc=False,
357-
NPY_DATETIMEUNIT creso=NPY_FR_ns,
357+
NPY_DATETIMEUNIT creso=NPY_DATETIMEUNIT.NPY_FR_GENERIC,
358358
):
359359
"""
360360
Calculates the datetime structs represented by the passed array of strings
@@ -365,7 +365,7 @@ def array_strptime(
365365
fmt : string-like regex
366366
exact : matches must be exact if True, search if False
367367
errors : string specifying error handling, {'raise', 'coerce'}
368-
creso : NPY_DATETIMEUNIT, default NPY_FR_ns
368+
creso : NPY_DATETIMEUNIT, default NPY_FR_GENERIC
369369
Set to NPY_FR_GENERIC to infer a resolution.
370370
"""
371371

@@ -712,7 +712,7 @@ cdef tzinfo _parse_with_format(
712712
elif len(s) <= 6:
713713
item_reso[0] = NPY_DATETIMEUNIT.NPY_FR_us
714714
else:
715-
item_reso[0] = NPY_DATETIMEUNIT.NPY_FR_ns
715+
item_reso[0] = NPY_FR_ns
716716
# Pad to always return nanoseconds
717717
s += "0" * (9 - len(s))
718718
us = int(s)

Diff for: pandas/core/algorithms.py

+5-3
Original file line numberDiff line numberDiff line change
@@ -346,14 +346,15 @@ def unique(values):
346346
array([2, 1])
347347
348348
>>> pd.unique(pd.Series([pd.Timestamp("20160101"), pd.Timestamp("20160101")]))
349-
array(['2016-01-01T00:00:00.000000000'], dtype='datetime64[ns]')
349+
array(['2016-01-01T00:00:00'], dtype='datetime64[s]')
350350
351351
>>> pd.unique(
352352
... pd.Series(
353353
... [
354354
... pd.Timestamp("20160101", tz="US/Eastern"),
355355
... pd.Timestamp("20160101", tz="US/Eastern"),
356-
... ]
356+
... ],
357+
... dtype="M8[ns, US/Eastern]",
357358
... )
358359
... )
359360
<DatetimeArray>
@@ -365,7 +366,8 @@ def unique(values):
365366
... [
366367
... pd.Timestamp("20160101", tz="US/Eastern"),
367368
... pd.Timestamp("20160101", tz="US/Eastern"),
368-
... ]
369+
... ],
370+
... dtype="M8[ns, US/Eastern]",
369371
... )
370372
... )
371373
DatetimeIndex(['2016-01-01 00:00:00-05:00'],

Diff for: pandas/core/arrays/datetimelike.py

+6-6
Original file line numberDiff line numberDiff line change
@@ -1849,11 +1849,11 @@ def strftime(self, date_format: str) -> npt.NDArray[np.object_]:
18491849
18501850
>>> rng_tz.floor("2h", ambiguous=False)
18511851
DatetimeIndex(['2021-10-31 02:00:00+01:00'],
1852-
dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
1852+
dtype='datetime64[s, Europe/Amsterdam]', freq=None)
18531853
18541854
>>> rng_tz.floor("2h", ambiguous=True)
18551855
DatetimeIndex(['2021-10-31 02:00:00+02:00'],
1856-
dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
1856+
dtype='datetime64[s, Europe/Amsterdam]', freq=None)
18571857
"""
18581858

18591859
_floor_example = """>>> rng.floor('h')
@@ -1876,11 +1876,11 @@ def strftime(self, date_format: str) -> npt.NDArray[np.object_]:
18761876
18771877
>>> rng_tz.floor("2h", ambiguous=False)
18781878
DatetimeIndex(['2021-10-31 02:00:00+01:00'],
1879-
dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
1879+
dtype='datetime64[s, Europe/Amsterdam]', freq=None)
18801880
18811881
>>> rng_tz.floor("2h", ambiguous=True)
18821882
DatetimeIndex(['2021-10-31 02:00:00+02:00'],
1883-
dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
1883+
dtype='datetime64[s, Europe/Amsterdam]', freq=None)
18841884
"""
18851885

18861886
_ceil_example = """>>> rng.ceil('h')
@@ -1903,11 +1903,11 @@ def strftime(self, date_format: str) -> npt.NDArray[np.object_]:
19031903
19041904
>>> rng_tz.ceil("h", ambiguous=False)
19051905
DatetimeIndex(['2021-10-31 02:00:00+01:00'],
1906-
dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
1906+
dtype='datetime64[s, Europe/Amsterdam]', freq=None)
19071907
19081908
>>> rng_tz.ceil("h", ambiguous=True)
19091909
DatetimeIndex(['2021-10-31 02:00:00+02:00'],
1910-
dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
1910+
dtype='datetime64[s, Europe/Amsterdam]', freq=None)
19111911
"""
19121912

19131913

Diff for: pandas/core/arrays/datetimes.py

+18-17
Original file line numberDiff line numberDiff line change
@@ -218,7 +218,7 @@ class DatetimeArray(dtl.TimelikeOps, dtl.DatelikeOps): # type: ignore[misc]
218218
... )
219219
<DatetimeArray>
220220
['2023-01-01 00:00:00', '2023-01-02 00:00:00']
221-
Length: 2, dtype: datetime64[ns]
221+
Length: 2, dtype: datetime64[s]
222222
"""
223223

224224
_typ = "datetimearray"
@@ -613,7 +613,7 @@ def tz(self) -> tzinfo | None:
613613
>>> s
614614
0 2020-01-01 10:00:00+00:00
615615
1 2020-02-01 11:00:00+00:00
616-
dtype: datetime64[ns, UTC]
616+
dtype: datetime64[s, UTC]
617617
>>> s.dt.tz
618618
datetime.timezone.utc
619619
@@ -1047,7 +1047,7 @@ def tz_localize(
10471047
4 2018-10-28 02:30:00+01:00
10481048
5 2018-10-28 03:00:00+01:00
10491049
6 2018-10-28 03:30:00+01:00
1050-
dtype: datetime64[ns, CET]
1050+
dtype: datetime64[s, CET]
10511051
10521052
In some cases, inferring the DST is impossible. In such cases, you can
10531053
pass an ndarray to the ambiguous parameter to set the DST explicitly
@@ -1059,14 +1059,14 @@ def tz_localize(
10591059
0 2018-10-28 01:20:00+02:00
10601060
1 2018-10-28 02:36:00+02:00
10611061
2 2018-10-28 03:46:00+01:00
1062-
dtype: datetime64[ns, CET]
1062+
dtype: datetime64[s, CET]
10631063
10641064
If the DST transition causes nonexistent times, you can shift these
10651065
dates forward or backwards with a timedelta object or `'shift_forward'`
10661066
or `'shift_backwards'`.
10671067
10681068
>>> s = pd.to_datetime(pd.Series(['2015-03-29 02:30:00',
1069-
... '2015-03-29 03:30:00']))
1069+
... '2015-03-29 03:30:00'], dtype="M8[ns]"))
10701070
>>> s.dt.tz_localize('Europe/Warsaw', nonexistent='shift_forward')
10711071
0 2015-03-29 03:00:00+02:00
10721072
1 2015-03-29 03:30:00+02:00
@@ -1427,7 +1427,7 @@ def time(self) -> npt.NDArray[np.object_]:
14271427
>>> s
14281428
0 2020-01-01 10:00:00+00:00
14291429
1 2020-02-01 11:00:00+00:00
1430-
dtype: datetime64[ns, UTC]
1430+
dtype: datetime64[s, UTC]
14311431
>>> s.dt.time
14321432
0 10:00:00
14331433
1 11:00:00
@@ -1470,7 +1470,7 @@ def timetz(self) -> npt.NDArray[np.object_]:
14701470
>>> s
14711471
0 2020-01-01 10:00:00+00:00
14721472
1 2020-02-01 11:00:00+00:00
1473-
dtype: datetime64[ns, UTC]
1473+
dtype: datetime64[s, UTC]
14741474
>>> s.dt.timetz
14751475
0 10:00:00+00:00
14761476
1 11:00:00+00:00
@@ -1512,7 +1512,7 @@ def date(self) -> npt.NDArray[np.object_]:
15121512
>>> s
15131513
0 2020-01-01 10:00:00+00:00
15141514
1 2020-02-01 11:00:00+00:00
1515-
dtype: datetime64[ns, UTC]
1515+
dtype: datetime64[s, UTC]
15161516
>>> s.dt.date
15171517
0 2020-01-01
15181518
1 2020-02-01
@@ -1861,7 +1861,7 @@ def isocalendar(self) -> DataFrame:
18611861
>>> s
18621862
0 2020-01-01 10:00:00+00:00
18631863
1 2020-02-01 11:00:00+00:00
1864-
dtype: datetime64[ns, UTC]
1864+
dtype: datetime64[s, UTC]
18651865
>>> s.dt.dayofyear
18661866
0 1
18671867
1 32
@@ -1897,7 +1897,7 @@ def isocalendar(self) -> DataFrame:
18971897
>>> s
18981898
0 2020-01-01 10:00:00+00:00
18991899
1 2020-04-01 11:00:00+00:00
1900-
dtype: datetime64[ns, UTC]
1900+
dtype: datetime64[s, UTC]
19011901
>>> s.dt.quarter
19021902
0 1
19031903
1 2
@@ -1933,7 +1933,7 @@ def isocalendar(self) -> DataFrame:
19331933
>>> s
19341934
0 2020-01-01 10:00:00+00:00
19351935
1 2020-02-01 11:00:00+00:00
1936-
dtype: datetime64[ns, UTC]
1936+
dtype: datetime64[s, UTC]
19371937
>>> s.dt.daysinmonth
19381938
0 31
19391939
1 29
@@ -2372,9 +2372,9 @@ def _sequence_to_dt64(
23722372
data, copy = maybe_convert_dtype(data, copy, tz=tz)
23732373
data_dtype = getattr(data, "dtype", None)
23742374

2375-
if out_unit is None:
2376-
out_unit = "ns"
2377-
out_dtype = np.dtype(f"M8[{out_unit}]")
2375+
out_dtype = DT64NS_DTYPE
2376+
if out_unit is not None:
2377+
out_dtype = np.dtype(f"M8[{out_unit}]")
23782378

23792379
if data_dtype == object or is_string_dtype(data_dtype):
23802380
# TODO: We do not have tests specific to string-dtypes,
@@ -2400,7 +2400,7 @@ def _sequence_to_dt64(
24002400
dayfirst=dayfirst,
24012401
yearfirst=yearfirst,
24022402
allow_object=False,
2403-
out_unit=out_unit or "ns",
2403+
out_unit=out_unit,
24042404
)
24052405
copy = False
24062406
if tz and inferred_tz:
@@ -2508,7 +2508,7 @@ def objects_to_datetime64(
25082508
utc: bool = False,
25092509
errors: DateTimeErrorChoices = "raise",
25102510
allow_object: bool = False,
2511-
out_unit: str = "ns",
2511+
out_unit: str | None = None,
25122512
) -> tuple[np.ndarray, tzinfo | None]:
25132513
"""
25142514
Convert data to array of timestamps.
@@ -2524,7 +2524,8 @@ def objects_to_datetime64(
25242524
allow_object : bool
25252525
Whether to return an object-dtype ndarray instead of raising if the
25262526
data contains more than one timezone.
2527-
out_unit : str, default "ns"
2527+
out_unit : str or None, default None
2528+
None indicates we should do resolution inference.
25282529
25292530
Returns
25302531
-------

0 commit comments

Comments
 (0)