-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: parsing mixed-offset Timestamps with errors='ignore' and no format raises #50586
Conversation
b6f590c
to
79ce542
Compare
791c728
to
2381291
Compare
2381291
to
9d29eea
Compare
else: | ||
if is_coerce: | ||
iresult[i] = NPY_NAT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taking this outside the loop
except (ValueError, OverflowError): | ||
if is_coerce: | ||
iresult[i] = NPY_NAT | ||
continue | ||
raise TypeError( | ||
f"invalid string coercion to datetime " | ||
f"for \"{val}\", at position {i}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taking this outside the loop (also, why was a ValueError
being re-raised as TypeError
? Doesn't look right to me)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I would think this should have just raised the original ValueError
if not string_to_dts_failed: | ||
# No error reported by string_to_dts, pick back up | ||
# where we left off | ||
value = npy_datetimestruct_to_datetime(NPY_FR_ns, &dts) | ||
if out_local == 1: | ||
seen_datetime_offset = True | ||
# Store the out_tzoffset in seconds | ||
# since we store the total_seconds of | ||
# dateutil.tz.tzoffset objects | ||
out_tzoffset_vals.add(out_tzoffset * 60.) | ||
tz = timezone(timedelta(minutes=out_tzoffset)) | ||
value = tz_localize_to_utc_single(value, tz) | ||
out_local = 0 | ||
out_tzoffset = 0 | ||
else: | ||
# Add a marker for naive string, to track if we are | ||
# parsing mixed naive and aware strings | ||
out_tzoffset_vals.add("naive") | ||
iresult[i] = value | ||
check_dts_bounds(&dts) | ||
if len(val) == 0 or val in nat_strings: | ||
iresult[i] = NPY_NAT | ||
continue | ||
|
||
else: | ||
if is_coerce: | ||
iresult[i] = NPY_NAT | ||
string_to_dts_failed = string_to_dts( | ||
val, &dts, &out_bestunit, &out_local, | ||
&out_tzoffset, False, None, False | ||
) | ||
if string_to_dts_failed: | ||
# An error at this point is a _parsing_ error | ||
# specifically _not_ OutOfBoundsDatetime | ||
if parse_today_now(val, &iresult[i], utc): | ||
continue | ||
|
||
py_dt = parse_datetime_string(val, | ||
dayfirst=dayfirst, | ||
yearfirst=yearfirst) | ||
# If the dateutil parser returned tzinfo, capture it | ||
# to check if all arguments have the same tzinfo | ||
tz = py_dt.utcoffset() | ||
|
||
if tz is not None: | ||
seen_datetime_offset = True | ||
# dateutil timezone objects cannot be hashed, so | ||
# store the UTC offsets in seconds instead | ||
out_tzoffset_vals.add(tz.total_seconds()) | ||
else: | ||
raise TypeError(f"{type(val)} is not convertible to datetime") | ||
# Add a marker for naive string, to track if we are | ||
# parsing mixed naive and aware strings | ||
out_tzoffset_vals.add("naive") | ||
|
||
except OutOfBoundsDatetime as ex: | ||
ex.args = (f"{ex}, at position {i}",) | ||
if is_coerce: | ||
iresult[i] = NPY_NAT | ||
continue | ||
raise | ||
_ts = convert_datetime_to_tsobject(py_dt, None) | ||
iresult[i] = _ts.value | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have here:
if string_to_dts_failed:
...
if not_string_to_dts_failed:
....
Might as well change the second one to else
|
||
except TypeError: | ||
return _array_to_datetime_object(values, errors, dayfirst, yearfirst) | ||
except (TypeError, OverflowError, ValueError) as ex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currenly, OutOfBoundsDatetime
is in a separate block. That's because of #50587
If it's OK, I'll address that in a separate PR, to not make the diff too big here
@@ -105,7 +105,6 @@ Other enhancements | |||
- :meth:`DataFrame.plot.hist` now recognizes ``xlabel`` and ``ylabel`` arguments (:issue:`49793`) | |||
- Improved error message in :func:`to_datetime` for non-ISO8601 formats, informing users about the position of the first error (:issue:`50361`) | |||
- Improved error message when trying to align :class:`DataFrame` objects (for example, in :func:`DataFrame.compare`) to clarify that "identically labelled" refers to both index and columns (:issue:`50083`) | |||
- Performance improvement in :func:`to_datetime` when format is given or can be inferred (:issue:`50465`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one was in the wrong place to begin with - my bad, I missed this when reviewing
ts_strings = ["200622-12-31", "111111-24-11"] | ||
with pytest.raises( | ||
ValueError, | ||
match=r"^hour must be in 0\.\.23: 111111-24-11, at position 1$", | ||
match=( | ||
r"^offset must be a timedelta strictly between " | ||
r"-timedelta\(hours=24\) and timedelta\(hours=24\)., at position 0$" | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a fun one. on 1.5.x, if you only parse the first input, you get the "offset must be a timedelta strictly..." message, but not if you include the second one
In [6]: to_datetime(["200622-12-31"])
ValueError: offset must be a timedelta strictly between -timedelta(hours=24) and timedelta(hours=24).
In [7]: to_datetime(["200622-12-31", "111111-24-11"])
ParserError: hour must be in 0..23: 111111-24-11
Now, you'll get the same message in both cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fairly okay. Might be open to changing that TypeError
to a ValueError
mentioned in https://github.com/pandas-dev/pandas/pull/50586/files?diff=unified&w=0#r1062815771
sure - it's actually changed already, because I've removed the try-except (within the try-except, within the try-except...who'd have thought all these nested try-excepts would cause issues 😄 ) which re-reraised ValueError as TypeError - now the original error is raised (see one of the fixed tests) The performance gain on the snippet from #35296 is huge here BTW: 1.5.2: In [4]: %%time
...: df = pd.read_csv(csv_file, usecols=['ts_col'],
...: date_parser=lambda x: pd.to_datetime(x, format=timestamp_format),
...: parse_dates=['ts_col']
...: )
...:
...:
<timed exec>:1: FutureWarning:
Use pd.to_datetime instead.
CPU times: user 21.7 s, sys: 9.87 ms, total: 21.7 s
Wall time: 21.7 s Here: In [2]: %%time
...:
...: df = pd.read_csv(csv_file, usecols=['ts_col'],
...: date_parser=lambda x: pd.to_datetime(x, format=timestamp_format),
...: parse_dates=['ts_col']
...: )
...:
...:
CPU times: user 306 ms, sys: 22.9 ms, total: 329 ms
Wall time: 331 ms Thanks @sm-Fifteen for your report, this was really useful! |
Thanks @MarcoGorelli |
This is a refactor I'd been wanting to do for a while. Fortunately, it closes #50585, and also addresses the performance issue reported in #35296 (comment)
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.I'd highly recommend setting "ignore whitespace" when reviewing this