Skip to content

BUG: dt.timezone pydatetime parsed differently for ISO vs non-ISO dates #50025

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
MarcoGorelli opened this issue Dec 2, 2022 · 10 comments
Closed
3 tasks done
Labels
Bug Datetime Datetime data dtype Timezones Timezone data dtype

Comments

@MarcoGorelli
Copy link
Member

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

ts = dt.datetime(2020, 1, 1, 17, tzinfo=dt.timezone(-dt.timedelta(hours=1)))
ts_str = '2020-01-01 00:00:00-01:00'
print(to_datetime([ts, ts_str], format='%Y-%m-%d %H:%M:%S%z'))
print(to_datetime([ts, ts_str], format='%Y-%d-%m %H:%M:%S%z'))

Issue Description

This gives

DatetimeIndex(['2020-01-01 17:00:00-01:00', '2020-01-01 00:00:00-01:00'], dtype='datetime64[ns, pytz.FixedOffset(-60)]', freq=None)
Index([2020-01-01 17:00:00-01:00, 2020-01-01 00:00:00-01:00], dtype='object')

Expected Behavior

Not sure they should be different. I think ideally from a user's perspective, it shouldn't matter whether a format is ISO8601 or not

Installed Versions

INSTALLED VERSIONS

commit : 3377a6d
python : 3.8.15.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.102.1-microsoft-standard-WSL2
Version : #1 SMP Wed Mar 2 00:30:59 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 2.0.0.dev0+817.g3377a6d33d
numpy : 1.23.5
pytz : 2022.6
dateutil : 2.8.2
setuptools : 65.5.1
pip : 22.3.1
Cython : 0.29.32
pytest : 7.2.0
hypothesis : 6.59.0
sphinx : 4.5.0
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.9.1
html5lib : 1.1
pymysql : 1.0.2
psycopg2 : 2.9.3
jinja2 : 3.0.3
IPython : 8.7.0
pandas_datareader: 0.10.0
bs4 : 4.11.1
bottleneck : 1.3.5
brotli :
fastparquet : 2022.11.0
fsspec : 2021.11.0
gcsfs : 2021.11.0
matplotlib : 3.6.2
numba : 0.56.4
numexpr : 2.8.3
odfpy : None
openpyxl : 3.0.10
pandas_gbq : 0.17.9
pyarrow : 9.0.0
pyreadstat : 1.2.0
pyxlsb : 1.0.10
s3fs : 2021.11.0
scipy : 1.9.3
snappy :
sqlalchemy : 1.4.44
tables : 3.7.0
tabulate : 0.9.0
xarray : 2022.11.0
xlrd : 2.0.1
zstandard : 0.19.0
tzdata : None
qtpy : None
pyqt5 : None
None

@MarcoGorelli MarcoGorelli added Bug Needs Triage Issue that has not been reviewed by a pandas team member Datetime Datetime data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 2, 2022
@mroeschke mroeschke added the Timezones Timezone data dtype label Dec 2, 2022
@mroeschke
Copy link
Member

mroeschke commented Dec 2, 2022

FWIW I think we want any fixed offset timezone to always parse to datetime.timezone objects instead of pytz.FixedOffset eventually #49677

@MarcoGorelli
Copy link
Member Author

FWIW I think we want any fixed offset timezone to always parse to datetime.timezone objects instead of pytz.FixedOffset eventually #49677

thanks! looks like that'll be for pandas3.0 at least

for the next one though, do you agree that Index([2020-01-01 17:00:00-01:00, 2020-01-01 00:00:00-01:00], dtype='object') isn't right?

@mroeschke
Copy link
Member

do you agree that Index([2020-01-01 17:00:00-01:00, 2020-01-01 00:00:00-01:00], dtype='object') isn't right?

I'd actually say it's technically correct. The ts timezone is a dt.timezone object while ts_str timezone will be converted to a pytz.FixedOffset object even though they represent the same offset. I remember fixing some related bugs in the past with the philosophy of "don't coerce different timezone library objects into the same object unless an explicit operation like tz_convert("UTC") is called"

Similar reason why I think this is the correct behavior too

In [15]: pd.Index([pd.Timestamp("2022", tz="dateutil/US/Pacific"), pd.Timestamp("2022", tz="US/Pacific")])
Out[15]: Index([2022-01-01 00:00:00-08:00, 2022-01-01 00:00:00-08:00], dtype='object')

In [16]: pd.Index([pd.Timestamp("2022", tz="US/Pacific"), pd.Timestamp("2022", tz="US/Pacific")])
Out[16]: DatetimeIndex(['2022-01-01 00:00:00-08:00', '2022-01-01 00:00:00-08:00'], dtype='datetime64[ns, US/Pacific]', freq=None)

@MarcoGorelli
Copy link
Member Author

Thanks - yeah I can see that argument

OK so it's the ISO path that would need fixing

@MarcoGorelli
Copy link
Member Author

@jbrockmendel do you have thoughts on this?

I think the solution will need to be to combine the ISO and non-ISO paths (the only difference being how they handle string inputs), I'm just trying to figure out which parts to keep from which

Between this and #50071, it looks like we need to aim for a mixture of them

@jbrockmendel
Copy link
Member

BTW @MarcoGorelli im getting a lot of pings and trying to triage. pls let me know if any are time-sensitive or blockers etc.

I think the solution will need to be to combine the ISO and non-ISO paths

Do you mean moving the array_strptime logic into array_to_datetime? if so, yikes.

looks like that'll be for pandas3.0 at least

id be very happy to revive that. #49677 only changed fixed-offsets, so didnt require zoneinfo. i.e. doesnt need to wait until we drop py38

I lean towards agreeing with Marco that the behavior should not hinge on what implementations of FixedOffset the user or our defaults use.

@mroeschke
Copy link
Member

id be very happy to revive that. #49677 only changed fixed-offsets, so didnt require zoneinfo. i.e. doesnt need to wait until we drop py38

True, I think this would be an okay API change and signal that pandas is moving toward more stdlib tz implementations anyways

@MarcoGorelli
Copy link
Member Author

Here's another inconsistency: when there's a timezone-aware string and a timezone-naive datetime.

In the ISO case, the naive one is converted, whilst in the non-ISO case, we get an object Index.

In [2]: pd.to_datetime(["2020-01-01 01:00:00-01:00", datetime(2020, 1, 1, 3, 0)], format='%Y-%d-%m %H:%M:%S%z')
Out[2]: Index([2020-01-01 01:00:00-01:00, 2020-01-01 03:00:00], dtype='object')

In [3]: pd.to_datetime(["2020-01-01 01:00:00-01:00", datetime(2020, 1, 1, 3, 0)], format='%Y-%m-%d %H:%M:%S%z')
Out[3]: DatetimeIndex(['2020-01-01 01:00:00-01:00', '2020-01-01 02:00:00-01:00'], dtype='datetime64[ns, pytz.FixedOffset(-60)]', freq=None)

I think I prefer the first one

BTW @MarcoGorelli im getting a lot of pings and trying to triage. pls let me know if any are time-sensitive or blockers etc.

Thanks @jbrockmendel , and sorry for the all the pings. I'd really like to get PDEP4 in for pandas2.0, so if we could agree here on which path (ISO or non-ISO) is correct on what, then that'd help unblock it

Do you mean moving the array_strptime logic into array_to_datetime? if so, yikes.

I was thinking the opposite (calling string_to_dts from within array_strptime)

@jbrockmendel
Copy link
Member

Unless the user passed utc=True, then seeing a tznaive pydatetime and a tzaware-when-parsed string should raise in array_to_datetime, just like it would if the tzaware object were a pydatetime. So this should go down the relevant fallback path. i.e. i think the behavior in the first case looks more-correct.

@MarcoGorelli
Copy link
Member Author

closed in #50242

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Timezones Timezone data dtype
Projects
None yet
Development

No branches or pull requests

3 participants