BUG: parsing mixed-offset Timestamps with errors='ignore' and no format raises #50586

MarcoGorelli · 2023-01-05T16:09:35Z

This is a refactor I'd been wanting to do for a while. Fortunately, it closes #50585, and also addresses the performance issue reported in #35296 (comment)

closes BUG: parsing mixed-offset Timestamps with errors='ignore' and no format raises #50585 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

I'd highly recommend setting "ignore whitespace" when reviewing this

MarcoGorelli · 2023-01-05T19:20:24Z

pandas/_libs/tslib.pyx

-                else:
-                    if is_coerce:
-                        iresult[i] = NPY_NAT


taking this outside the loop

MarcoGorelli · 2023-01-05T19:20:57Z

pandas/_libs/tslib.pyx

-                        except (ValueError, OverflowError):
-                            if is_coerce:
-                                iresult[i] = NPY_NAT
-                                continue
-                            raise TypeError(
-                                f"invalid string coercion to datetime "
-                                f"for \"{val}\", at position {i}"
-                            )


taking this outside the loop (also, why was a ValueError being re-raised as TypeError? Doesn't look right to me)

Yeah I would think this should have just raised the original ValueError

MarcoGorelli · 2023-01-05T19:21:48Z

pandas/_libs/tslib.pyx

-                    if not string_to_dts_failed:
-                        # No error reported by string_to_dts, pick back up
-                        # where we left off
-                        value = npy_datetimestruct_to_datetime(NPY_FR_ns, &dts)
-                        if out_local == 1:
-                            seen_datetime_offset = True
-                            # Store the out_tzoffset in seconds
-                            # since we store the total_seconds of
-                            # dateutil.tz.tzoffset objects
-                            out_tzoffset_vals.add(out_tzoffset * 60.)
-                            tz = timezone(timedelta(minutes=out_tzoffset))
-                            value = tz_localize_to_utc_single(value, tz)
-                            out_local = 0
-                            out_tzoffset = 0
-                        else:
-                            # Add a marker for naive string, to track if we are
-                            # parsing mixed naive and aware strings
-                            out_tzoffset_vals.add("naive")
-                        iresult[i] = value
-                        check_dts_bounds(&dts)
+                if len(val) == 0 or val in nat_strings:
+                    iresult[i] = NPY_NAT
+                    continue

-                else:
-                    if is_coerce:
-                        iresult[i] = NPY_NAT
+                string_to_dts_failed = string_to_dts(
+                    val, &dts, &out_bestunit, &out_local,
+                    &out_tzoffset, False, None, False
+                )
+                if string_to_dts_failed:
+                    # An error at this point is a _parsing_ error
+                    # specifically _not_ OutOfBoundsDatetime
+                    if parse_today_now(val, &iresult[i], utc):
+                        continue
+
+                    py_dt = parse_datetime_string(val,
+                                                  dayfirst=dayfirst,
+                                                  yearfirst=yearfirst)
+                    # If the dateutil parser returned tzinfo, capture it
+                    # to check if all arguments have the same tzinfo
+                    tz = py_dt.utcoffset()
+
+                    if tz is not None:
+                        seen_datetime_offset = True
+                        # dateutil timezone objects cannot be hashed, so
+                        # store the UTC offsets in seconds instead
+                        out_tzoffset_vals.add(tz.total_seconds())
                    else:
-                        raise TypeError(f"{type(val)} is not convertible to datetime")
+                        # Add a marker for naive string, to track if we are
+                        # parsing mixed naive and aware strings
+                        out_tzoffset_vals.add("naive")

-            except OutOfBoundsDatetime as ex:
-                ex.args = (f"{ex}, at position {i}",)
-                if is_coerce:
-                    iresult[i] = NPY_NAT
-                    continue
-                raise
+                    _ts = convert_datetime_to_tsobject(py_dt, None)
+                    iresult[i] = _ts.value
+                else:


we have here:

if string_to_dts_failed: ... if not_string_to_dts_failed: ....

Might as well change the second one to else

MarcoGorelli · 2023-01-05T19:24:31Z

pandas/_libs/tslib.pyx


-    except TypeError:
-        return _array_to_datetime_object(values, errors, dayfirst, yearfirst)
+        except (TypeError, OverflowError, ValueError) as ex:


currenly, OutOfBoundsDatetime is in a separate block. That's because of #50587

If it's OK, I'll address that in a separate PR, to not make the diff too big here

MarcoGorelli · 2023-01-05T19:27:32Z

doc/source/whatsnew/v2.0.0.rst

@@ -105,7 +105,6 @@ Other enhancements
 - :meth:`DataFrame.plot.hist` now recognizes ``xlabel`` and ``ylabel`` arguments (:issue:`49793`)
 - Improved error message in :func:`to_datetime` for non-ISO8601 formats, informing users about the position of the first error (:issue:`50361`)
 - Improved error message when trying to align :class:`DataFrame` objects (for example, in :func:`DataFrame.compare`) to clarify that "identically labelled" refers to both index and columns (:issue:`50083`)
- Performance improvement in :func:`to_datetime` when format is given or can be inferred (:issue:`50465`)


this one was in the wrong place to begin with - my bad, I missed this when reviewing

MarcoGorelli · 2023-01-05T19:35:57Z

pandas/tests/tools/test_to_datetime.py

        ts_strings = ["200622-12-31", "111111-24-11"]
        with pytest.raises(
            ValueError,
-            match=r"^hour must be in 0\.\.23: 111111-24-11, at position 1$",
+            match=(
+                r"^offset must be a timedelta strictly between "
+                r"-timedelta\(hours=24\) and timedelta\(hours=24\)., at position 0$"
+            ),


this is a fun one. on 1.5.x, if you only parse the first input, you get the "offset must be a timedelta strictly..." message, but not if you include the second one

In [6]: to_datetime(["200622-12-31"]) ValueError: offset must be a timedelta strictly between -timedelta(hours=24) and timedelta(hours=24). In [7]: to_datetime(["200622-12-31", "111111-24-11"]) ParserError: hour must be in 0..23: 111111-24-11

Now, you'll get the same message in both cases

mroeschke

Looks fairly okay. Might be open to changing that TypeError to a ValueError mentioned in https://github.com/pandas-dev/pandas/pull/50586/files?diff=unified&w=0#r1062815771

MarcoGorelli · 2023-01-06T11:02:43Z

sure - it's actually changed already, because I've removed the try-except (within the try-except, within the try-except...who'd have thought all these nested try-excepts would cause issues 😄 ) which re-reraised ValueError as TypeError - now the original error is raised (see one of the fixed tests)

The performance gain on the snippet from #35296 is huge here BTW:

1.5.2:

In [4]: %%time
   ...: df = pd.read_csv(csv_file, usecols=['ts_col'],
   ...:     date_parser=lambda x: pd.to_datetime(x, format=timestamp_format),
   ...:     parse_dates=['ts_col']
   ...: )
   ...:
   ...:
<timed exec>:1: FutureWarning:
        Use pd.to_datetime instead.

CPU times: user 21.7 s, sys: 9.87 ms, total: 21.7 s
Wall time: 21.7 s

Here:

In [2]: %%time
   ...: 
   ...: df = pd.read_csv(csv_file, usecols=['ts_col'],
   ...:     date_parser=lambda x: pd.to_datetime(x, format=timestamp_format),
   ...:     parse_dates=['ts_col']
   ...: )
   ...: 
   ...: 
CPU times: user 306 ms, sys: 22.9 ms, total: 329 ms
Wall time: 331 ms

Thanks @sm-Fifteen for your report, this was really useful!

mroeschke · 2023-01-06T19:53:41Z

Thanks @MarcoGorelli

MarcoGorelli force-pushed the ref-tslib branch 3 times, most recently from b6f590c to 79ce542 Compare January 5, 2023 16:49

MarcoGorelli marked this pull request as ready for review January 5, 2023 16:49

MarcoGorelli added Bug Datetime Datetime data dtype labels Jan 5, 2023

MarcoGorelli force-pushed the ref-tslib branch 3 times, most recently from 791c728 to 2381291 Compare January 5, 2023 17:40

MarcoGorelli added 3 commits January 5, 2023 18:01

refactor and fix bug

9d29eea

simplify

a334a84

whatsnew notes

1a7d888

MarcoGorelli marked this pull request as draft January 5, 2023 18:17

MarcoGorelli mentioned this pull request Jan 5, 2023

DOC: Reccomended use of read_csv's date_parser parameter is very slow #35296

Closed

MarcoGorelli force-pushed the ref-tslib branch from 2381291 to 9d29eea Compare January 5, 2023 19:16

MarcoGorelli commented Jan 5, 2023

View reviewed changes

MarcoGorelli marked this pull request as ready for review January 5, 2023 19:27

MarcoGorelli commented Jan 5, 2023

View reviewed changes

MarcoGorelli requested review from jbrockmendel and mroeschke January 5, 2023 21:58

mroeschke reviewed Jan 5, 2023

View reviewed changes

Merge branch 'main' into ref-tslib

79e5cc9

MarcoGorelli mentioned this pull request Jan 6, 2023

API deprecate date_parser, add date_format #50601

Closed

MarcoGorelli requested a review from mroeschke January 6, 2023 13:59

mroeschke approved these changes Jan 6, 2023

View reviewed changes

mroeschke added this to the 2.0 milestone Jan 6, 2023

mroeschke merged commit d0e6933 into pandas-dev:main Jan 6, 2023

MarcoGorelli mentioned this pull request Jan 10, 2023

BUG: parsing boolean raises TypeError without format, ValueError with format #50584

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: parsing mixed-offset Timestamps with errors='ignore' and no format raises #50586

BUG: parsing mixed-offset Timestamps with errors='ignore' and no format raises #50586

MarcoGorelli commented Jan 5, 2023 •

edited

Loading

MarcoGorelli Jan 5, 2023

MarcoGorelli Jan 5, 2023

mroeschke Jan 5, 2023

MarcoGorelli Jan 5, 2023

MarcoGorelli Jan 5, 2023

MarcoGorelli Jan 5, 2023

MarcoGorelli Jan 5, 2023

mroeschke left a comment

MarcoGorelli commented Jan 6, 2023 •

edited

Loading

mroeschke commented Jan 6, 2023

BUG: parsing mixed-offset Timestamps with errors='ignore' and no format raises #50586

BUG: parsing mixed-offset Timestamps with errors='ignore' and no format raises #50586

Conversation

MarcoGorelli commented Jan 5, 2023 • edited Loading

MarcoGorelli Jan 5, 2023

Choose a reason for hiding this comment

MarcoGorelli Jan 5, 2023

Choose a reason for hiding this comment

mroeschke Jan 5, 2023

Choose a reason for hiding this comment

MarcoGorelli Jan 5, 2023

Choose a reason for hiding this comment

MarcoGorelli Jan 5, 2023

Choose a reason for hiding this comment

MarcoGorelli Jan 5, 2023

Choose a reason for hiding this comment

MarcoGorelli Jan 5, 2023

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

MarcoGorelli commented Jan 6, 2023 • edited Loading

mroeschke commented Jan 6, 2023

MarcoGorelli commented Jan 5, 2023 •

edited

Loading

MarcoGorelli commented Jan 6, 2023 •

edited

Loading