Skip to content

BUG refactor datetime parsing and fix 8 bugs #50242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Dec 13, 2022

this'd solve a number of issues

work in progress


Performance: this maintains the fastness for ISO formats:

format = '%Y-%d-%m %H:%M:%S%z'
dates = pd.date_range('1900', '2000').tz_localize('+01:00').strftime(format).tolist()

upstream/main:

In [2]: %%timeit
   ...: pd.to_datetime(dates, format=format)
   ...: 
   ...: 
241 ms ± 3.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

here

In [2]: %%timeit
   ...: pd.to_datetime(dates, format=format)
   ...: 
   ...: 
221 ms ± 5.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Demo of how this addresses #17410

In [8]: s = pd.Series(['20120101']*1000000)

In [9]: %timeit pd.to_datetime(s, cache=False)  # no format
72.7 ms ± 929 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [10]: %timeit pd.to_datetime(s, cache=False, format='%Y%m%d')  # slightly faster, as it doesn't need to guess the format
72.2 ms ± 665 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [11]: %timeit pd.to_datetime(s, cache=False, format='%Y%d%m')  # by comparison, non-ISO is much slower
1.12 s ± 52.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Now real difference for non-ISO formats:

1.5.2:

In [16]: format = "%m-%d-%Y"

In [17]: dates = pd.date_range('1900', '2000').tz_localize('+01:00').strftime(format).tolist()

In [18]: %%timeit
    ...: pd.to_datetime(dates, format=format)
    ...:
    ...:
43.5 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

here:

In [2]: format = "%m-%d-%Y"

In [3]: dates = pd.date_range('1900', '2000').tz_localize('+01:00').strftime(format).tolist()

In [4]: %%timeit
   ...: pd.to_datetime(dates, format=format)
   ...: 
   ...: 
42.4 ms ± 405 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

note

gonna try to get #50361 in first, so marking as draft for now

@MarcoGorelli MarcoGorelli added the Datetime Datetime data dtype label Dec 13, 2022
@MarcoGorelli MarcoGorelli force-pushed the share-datetime-parsing-format-paths branch from adff421 to 73a909d Compare December 14, 2022 08:24
@MarcoGorelli MarcoGorelli force-pushed the share-datetime-parsing-format-paths branch from 37d8b15 to 7617774 Compare December 14, 2022 08:59
Comment on lines 207 to 286
if (iso_format and not (fmt == "%Y%m%d" and len(val) != 8)):
# There is a fast-path for ISO8601-formatted strings.
# BUT for %Y%m%d, it only works if the string is 8-digits long.
string_to_dts_failed = string_to_dts(
val, &dts, &out_bestunit, &out_local,
&out_tzoffset, False, fmt, exact
)
if string_to_dts_failed:
# An error at this point is a _parsing_ error
# specifically _not_ OutOfBoundsDatetime
if is_coerce:
iresult[i] = NPY_NAT
continue
raise ValueError(
f"time data \"{val}\" at position {i} doesn't "
f"match format \"{fmt}\""
)
# No error reported by string_to_dts, pick back up
# where we left off
value = npy_datetimestruct_to_datetime(NPY_FR_ns, &dts)
if out_local == 1:
# Store the out_tzoffset in seconds
# since we store the total_seconds of
# dateutil.tz.tzoffset objects
# out_tzoffset_vals.add(out_tzoffset * 60.)
tz = timezone(timedelta(minutes=out_tzoffset))
result_timezone[i] = tz
# value = tz_localize_to_utc_single(value, tz)
out_local = 0
out_tzoffset = 0
iresult[i] = value
try:
check_dts_bounds(&dts)
except ValueError:
if is_coerce:
iresult[i] = NPY_NAT
continue
raise
continue
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this pretty-much matches

string_to_dts_failed = string_to_dts(
val, &dts, &out_bestunit, &out_local,
&out_tzoffset, False, format, exact
)
if string_to_dts_failed:
# An error at this point is a _parsing_ error
# specifically _not_ OutOfBoundsDatetime
if _parse_today_now(val, &iresult[i], utc):
continue
elif require_iso8601:
# if requiring iso8601 strings, skip trying
# other formats
if is_coerce:
iresult[i] = NPY_NAT
continue
elif is_raise:
raise ValueError(
f"time data \"{val}\" at position {i} doesn't "
f"match format \"{format}\""
)
return values, tz_out
try:
py_dt = parse_datetime_string(val,
dayfirst=dayfirst,
yearfirst=yearfirst)
# If the dateutil parser returned tzinfo, capture it
# to check if all arguments have the same tzinfo
tz = py_dt.utcoffset()
except (ValueError, OverflowError):
if is_coerce:
iresult[i] = NPY_NAT
continue
raise TypeError(
f"invalid string coercion to datetime for \"{val}\" "
f"at position {i}"
)
if tz is not None:
seen_datetime_offset = True
# dateutil timezone objects cannot be hashed, so
# store the UTC offsets in seconds instead
out_tzoffset_vals.add(tz.total_seconds())
else:
# Add a marker for naive string, to track if we are
# parsing mixed naive and aware strings
out_tzoffset_vals.add("naive")
_ts = convert_datetime_to_tsobject(py_dt, None)
iresult[i] = _ts.value
if not string_to_dts_failed:
# No error reported by string_to_dts, pick back up
# where we left off
value = npy_datetimestruct_to_datetime(NPY_FR_ns, &dts)
if out_local == 1:
seen_datetime_offset = True
# Store the out_tzoffset in seconds
# since we store the total_seconds of
# dateutil.tz.tzoffset objects
out_tzoffset_vals.add(out_tzoffset * 60.)
tz = timezone(timedelta(minutes=out_tzoffset))
value = tz_localize_to_utc_single(value, tz)
out_local = 0
out_tzoffset = 0
else:
# Add a marker for naive string, to track if we are
# parsing mixed naive and aware strings
out_tzoffset_vals.add("naive")
iresult[i] = value
check_dts_bounds(&dts)

but it's simpler as we don't need to try parse_datetime_string. That's because if we got here, we know that we're expecting some specific ISO8601 format, so if string_to_dts can't parse it, then we need to coerce/raise/ignore, but there's no need to try other formats

Comment on lines -960 to +930
datetime.datetime(1300, 1, 1, 0, 0)
'13000101'
Copy link
Member Author

@MarcoGorelli MarcoGorelli Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines -999 to +968
DatetimeIndex(['2020-01-01 01:00:00-01:00', '2020-01-01 02:00:00-01:00'],
dtype='datetime64[ns, UTC-01:00]', freq=None)
Index([2020-01-01 01:00:00-01:00, 2020-01-01 03:00:00], dtype='object')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines -157 to +161
# The +9 format for offsets is supported by dateutil,
# but don't round-trip, see https://github.com/pandas-dev/pandas/issues/48921
("2011-12-30T00:00:00+9", None),
("2011-12-30T00:00:00+09", None),
("2011-12-30T00:00:00+9", "%Y-%m-%dT%H:%M:%S%z"),
("2011-12-30T00:00:00+09", "%Y-%m-%dT%H:%M:%S%z"),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice! In

try:
array_strptime(np.asarray([dt_str], dtype=object), guessed_format)
except ValueError:
# Doesn't parse, so this can't be the correct format.
return None

we check that array_strptime can parse the first non-null element with the guessed format. Now that array_strptime can parse both ISO and non-ISO formats, we're expanding on the list of formats which can be guessed!

@MarcoGorelli

This comment was marked as outdated.

@MarcoGorelli MarcoGorelli force-pushed the share-datetime-parsing-format-paths branch from f72d3d6 to 3283b81 Compare December 18, 2022 19:16
@MarcoGorelli MarcoGorelli mentioned this pull request Dec 18, 2022
@MarcoGorelli MarcoGorelli force-pushed the share-datetime-parsing-format-paths branch from a177975 to d37e743 Compare December 20, 2022 10:20
@MarcoGorelli MarcoGorelli marked this pull request as ready for review December 20, 2022 10:21
@MarcoGorelli MarcoGorelli changed the title WIP Share datetime parsing format paths BUG Share datetime parsing format paths and fix 7 bugs Dec 20, 2022
@MarcoGorelli MarcoGorelli changed the title BUG Share datetime parsing format paths and fix 7 bugs BUG refactor datetime parsing and fix 7 bugs Dec 20, 2022
@MarcoGorelli MarcoGorelli force-pushed the share-datetime-parsing-format-paths branch from d37e743 to 0c95207 Compare December 20, 2022 14:26
@MarcoGorelli MarcoGorelli marked this pull request as draft December 20, 2022 14:26
@MarcoGorelli MarcoGorelli force-pushed the share-datetime-parsing-format-paths branch from 0c95207 to 3257a31 Compare December 20, 2022 15:11
@MarcoGorelli MarcoGorelli changed the title BUG refactor datetime parsing and fix 7 bugs BUG refactor datetime parsing and fix 8 bugs Dec 20, 2022
@MarcoGorelli MarcoGorelli marked this pull request as ready for review December 20, 2022 15:12
@MarcoGorelli MarcoGorelli force-pushed the share-datetime-parsing-format-paths branch from 3257a31 to 2f8fade Compare December 20, 2022 16:11
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. have comments on a couple things I think can happen as follow ups

"""
excluded_formats = ["%Y%m"]

for date_sep in [" ", "/", "\\", "-", ".", ""]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a loop can you express this as a regular expression? Seems like it would help the performance that way as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nevermind I see this is the way it is currently written - something to consider for another PR though. My guess is can only help

@@ -550,25 +536,13 @@ cpdef array_to_datetime(

string_to_dts_failed = string_to_dts(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error messaging here is a bit confusing to me - looks like string_to_dts is already labeled ?except -1. Is there a reason why Cython doesn't propogate an error before your check of if string_to_dts_failed?

Copy link
Member Author

@MarcoGorelli MarcoGorelli Dec 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because here want_exc is False:

parse_error:
if (want_exc) {
PyErr_Format(PyExc_ValueError,
"Error parsing datetime string \"%s\" at position %d", str,
(int)(substr - str));
}
return -1;

The only place where it's True is

object[::1] res_flat = result.ravel() # should NOT be a copy
cnp.flatiter it = cnp.PyArray_IterNew(values)
if na_rep is None:
na_rep = "NaT"
if tz is None:
# if we don't have a format nor tz, then choose
# a format based on precision
basic_format = format is None
if basic_format:
reso_obj = get_resolution(values, tz=tz, reso=reso)
show_ns = reso_obj == Resolution.RESO_NS
show_us = reso_obj == Resolution.RESO_US
show_ms = reso_obj == Resolution.RESO_MS
elif format == "%Y-%m-%d %H:%M:%S":
# Same format as default, but with hardcoded precision (s)

which is only a testing function. So, perhaps the ?except -1 can just be removed, and the testing function removed (I think it would be better to test to_datetime directly).

I'd keep that to a separate PR anyway, but thanks for catching this!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thanks for review. Yea I'd be OK with your suggestion in a separate PR. Always good to clean this up - not sure we've handled consistently in the past

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@MarcoGorelli
Copy link
Member Author

Nice!

Thanks!

Can I ask that we get #50366 in first though? That'll reduce the diff in this one

@MarcoGorelli MarcoGorelli force-pushed the share-datetime-parsing-format-paths branch from 9c5c378 to 392d239 Compare December 28, 2022 11:40
@MarcoGorelli
Copy link
Member Author

Can I ask that we get #50366 in first though? That'll reduce the diff in this one

Cool, that's in, and I've rebased.

Thanks for your reviews and approvals - @jbrockmendel any further thoughts?

Copy link
Member

@jbrockmendel jbrockmendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for being persistent on this

@MarcoGorelli
Copy link
Member Author

Nice, thanks for being persistent on this

Thanks!

@WillAyd @mroeschke any further comments, or good-to-merge?

@MarcoGorelli MarcoGorelli added this to the 2.0 milestone Dec 29, 2022
@WillAyd WillAyd merged commit 502919e into pandas-dev:main Dec 29, 2022
@WillAyd
Copy link
Member

WillAyd commented Dec 29, 2022

Thanks @MarcoGorelli

# Store the out_tzoffset in seconds
# since we store the total_seconds of
# dateutil.tz.tzoffset objects
tz = timezone(timedelta(minutes=out_tzoffset))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the analogous block in tslib we then adjust value using tz_localize_to_utc. do we need to do that here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it happens a few levels up, here:

tz_results = np.empty(len(result), dtype=object)
for zone in unique(timezones):
mask = timezones == zone
dta = DatetimeArray(result[mask]).tz_localize(zone)
if utc:
if dta.tzinfo is None:
dta = dta.tz_localize("utc")
else:
dta = dta.tz_convert("utc")
tz_results[mask] = dta

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, thanks. would it be viable to use the same pattern so we can share more code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would indeed be good, I'll see what I can do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment