BUG: dt.timezone pydatetime parsed differently for ISO vs non-ISO dates #50025

MarcoGorelli · 2022-12-02T19:27:09Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

ts = dt.datetime(2020, 1, 1, 17, tzinfo=dt.timezone(-dt.timedelta(hours=1)))
ts_str = '2020-01-01 00:00:00-01:00'
print(to_datetime([ts, ts_str], format='%Y-%m-%d %H:%M:%S%z'))
print(to_datetime([ts, ts_str], format='%Y-%d-%m %H:%M:%S%z'))

Issue Description

This gives

DatetimeIndex(['2020-01-01 17:00:00-01:00', '2020-01-01 00:00:00-01:00'], dtype='datetime64[ns, pytz.FixedOffset(-60)]', freq=None)
Index([2020-01-01 17:00:00-01:00, 2020-01-01 00:00:00-01:00], dtype='object')

Expected Behavior

Not sure they should be different. I think ideally from a user's perspective, it shouldn't matter whether a format is ISO8601 or not

Installed Versions

INSTALLED VERSIONS

commit : 3377a6d
python : 3.8.15.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.102.1-microsoft-standard-WSL2
Version : #1 SMP Wed Mar 2 00:30:59 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 2.0.0.dev0+817.g3377a6d33d
numpy : 1.23.5
pytz : 2022.6
dateutil : 2.8.2
setuptools : 65.5.1
pip : 22.3.1
Cython : 0.29.32
pytest : 7.2.0
hypothesis : 6.59.0
sphinx : 4.5.0
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.9.1
html5lib : 1.1
pymysql : 1.0.2
psycopg2 : 2.9.3
jinja2 : 3.0.3
IPython : 8.7.0
pandas_datareader: 0.10.0
bs4 : 4.11.1
bottleneck : 1.3.5
brotli :
fastparquet : 2022.11.0
fsspec : 2021.11.0
gcsfs : 2021.11.0
matplotlib : 3.6.2
numba : 0.56.4
numexpr : 2.8.3
odfpy : None
openpyxl : 3.0.10
pandas_gbq : 0.17.9
pyarrow : 9.0.0
pyreadstat : 1.2.0
pyxlsb : 1.0.10
s3fs : 2021.11.0
scipy : 1.9.3
snappy :
sqlalchemy : 1.4.44
tables : 3.7.0
tabulate : 0.9.0
xarray : 2022.11.0
xlrd : 2.0.1
zstandard : 0.19.0
tzdata : None
qtpy : None
pyqt5 : None
None

The text was updated successfully, but these errors were encountered:

mroeschke · 2022-12-02T22:19:19Z

FWIW I think we want any fixed offset timezone to always parse to datetime.timezone objects instead of pytz.FixedOffset eventually #49677

MarcoGorelli · 2022-12-03T10:01:15Z

FWIW I think we want any fixed offset timezone to always parse to datetime.timezone objects instead of pytz.FixedOffset eventually #49677

thanks! looks like that'll be for pandas3.0 at least

for the next one though, do you agree that Index([2020-01-01 17:00:00-01:00, 2020-01-01 00:00:00-01:00], dtype='object') isn't right?

mroeschke · 2022-12-05T19:53:37Z

do you agree that Index([2020-01-01 17:00:00-01:00, 2020-01-01 00:00:00-01:00], dtype='object') isn't right?

I'd actually say it's technically correct. The ts timezone is a dt.timezone object while ts_str timezone will be converted to a pytz.FixedOffset object even though they represent the same offset. I remember fixing some related bugs in the past with the philosophy of "don't coerce different timezone library objects into the same object unless an explicit operation like tz_convert("UTC") is called"

Similar reason why I think this is the correct behavior too

In [15]: pd.Index([pd.Timestamp("2022", tz="dateutil/US/Pacific"), pd.Timestamp("2022", tz="US/Pacific")])
Out[15]: Index([2022-01-01 00:00:00-08:00, 2022-01-01 00:00:00-08:00], dtype='object')

In [16]: pd.Index([pd.Timestamp("2022", tz="US/Pacific"), pd.Timestamp("2022", tz="US/Pacific")])
Out[16]: DatetimeIndex(['2022-01-01 00:00:00-08:00', '2022-01-01 00:00:00-08:00'], dtype='datetime64[ns, US/Pacific]', freq=None)

MarcoGorelli · 2022-12-05T20:04:14Z

Thanks - yeah I can see that argument

OK so it's the ISO path that would need fixing

MarcoGorelli · 2022-12-05T22:16:29Z

@jbrockmendel do you have thoughts on this?

I think the solution will need to be to combine the ISO and non-ISO paths (the only difference being how they handle string inputs), I'm just trying to figure out which parts to keep from which

Between this and #50071, it looks like we need to aim for a mixture of them

jbrockmendel · 2022-12-05T23:10:46Z

BTW @MarcoGorelli im getting a lot of pings and trying to triage. pls let me know if any are time-sensitive or blockers etc.

I think the solution will need to be to combine the ISO and non-ISO paths

Do you mean moving the array_strptime logic into array_to_datetime? if so, yikes.

looks like that'll be for pandas3.0 at least

id be very happy to revive that. #49677 only changed fixed-offsets, so didnt require zoneinfo. i.e. doesnt need to wait until we drop py38

I lean towards agreeing with Marco that the behavior should not hinge on what implementations of FixedOffset the user or our defaults use.

mroeschke · 2022-12-05T23:16:42Z

id be very happy to revive that. #49677 only changed fixed-offsets, so didnt require zoneinfo. i.e. doesnt need to wait until we drop py38

True, I think this would be an okay API change and signal that pandas is moving toward more stdlib tz implementations anyways

MarcoGorelli · 2022-12-06T09:45:00Z

Here's another inconsistency: when there's a timezone-aware string and a timezone-naive datetime.

In the ISO case, the naive one is converted, whilst in the non-ISO case, we get an object Index.

In [2]: pd.to_datetime(["2020-01-01 01:00:00-01:00", datetime(2020, 1, 1, 3, 0)], format='%Y-%d-%m %H:%M:%S%z')
Out[2]: Index([2020-01-01 01:00:00-01:00, 2020-01-01 03:00:00], dtype='object')

In [3]: pd.to_datetime(["2020-01-01 01:00:00-01:00", datetime(2020, 1, 1, 3, 0)], format='%Y-%m-%d %H:%M:%S%z')
Out[3]: DatetimeIndex(['2020-01-01 01:00:00-01:00', '2020-01-01 02:00:00-01:00'], dtype='datetime64[ns, pytz.FixedOffset(-60)]', freq=None)

I think I prefer the first one

BTW @MarcoGorelli im getting a lot of pings and trying to triage. pls let me know if any are time-sensitive or blockers etc.

Thanks @jbrockmendel , and sorry for the all the pings. I'd really like to get PDEP4 in for pandas2.0, so if we could agree here on which path (ISO or non-ISO) is correct on what, then that'd help unblock it

Do you mean moving the array_strptime logic into array_to_datetime? if so, yikes.

I was thinking the opposite (calling string_to_dts from within array_strptime)

jbrockmendel · 2022-12-07T23:04:50Z

Unless the user passed utc=True, then seeing a tznaive pydatetime and a tzaware-when-parsed string should raise in array_to_datetime, just like it would if the tzaware object were a pydatetime. So this should go down the relevant fallback path. i.e. i think the behavior in the first case looks more-correct.

MarcoGorelli · 2023-01-06T14:13:42Z

closed in #50242

MarcoGorelli added Bug Needs Triage Issue that has not been reviewed by a pandas team member Datetime Datetime data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 2, 2022

mroeschke added the Timezones Timezone data dtype label Dec 2, 2022

MarcoGorelli mentioned this issue Dec 6, 2022

PDEP0004: implementation #49024

Merged

6 tasks

MarcoGorelli mentioned this issue Dec 14, 2022

BUG refactor datetime parsing and fix 8 bugs #50242

Merged

12 tasks

MarcoGorelli closed this as completed Jan 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: dt.timezone pydatetime parsed differently for ISO vs non-ISO dates #50025

BUG: dt.timezone pydatetime parsed differently for ISO vs non-ISO dates #50025

MarcoGorelli commented Dec 2, 2022

INSTALLED VERSIONS

mroeschke commented Dec 2, 2022 •

edited

Loading

MarcoGorelli commented Dec 3, 2022

mroeschke commented Dec 5, 2022

MarcoGorelli commented Dec 5, 2022

MarcoGorelli commented Dec 5, 2022

jbrockmendel commented Dec 5, 2022

mroeschke commented Dec 5, 2022

MarcoGorelli commented Dec 6, 2022

jbrockmendel commented Dec 7, 2022

MarcoGorelli commented Jan 6, 2023

BUG: dt.timezone pydatetime parsed differently for ISO vs non-ISO dates #50025

BUG: dt.timezone pydatetime parsed differently for ISO vs non-ISO dates #50025

Comments

MarcoGorelli commented Dec 2, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

mroeschke commented Dec 2, 2022 • edited Loading

MarcoGorelli commented Dec 3, 2022

mroeschke commented Dec 5, 2022

MarcoGorelli commented Dec 5, 2022

MarcoGorelli commented Dec 5, 2022

jbrockmendel commented Dec 5, 2022

mroeschke commented Dec 5, 2022

MarcoGorelli commented Dec 6, 2022

jbrockmendel commented Dec 7, 2022

MarcoGorelli commented Jan 6, 2023

mroeschke commented Dec 2, 2022 •

edited

Loading