-
Notifications
You must be signed in to change notification settings - Fork 228
clib.conversion._to_numpy: Add tests for PyArrow's timestamp type #3621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 4 commits
ce2b830
214f060
bf16e97
e0c2e7b
95ac538
7cbfeb8
c622438
7bc1f6f
5c2eb46
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -18,6 +18,18 @@ | ||||||||||||||||
|
|||||||||||||||||
_HAS_PYARROW = True | |||||||||||||||||
except ImportError: | |||||||||||||||||
|
|||||||||||||||||
class pa: # noqa: N801 | |||||||||||||||||
""" | |||||||||||||||||
A dummy class to mimic pyarrow. | |||||||||||||||||
""" | |||||||||||||||||
|
|||||||||||||||||
@staticmethod | |||||||||||||||||
def timestamp(*args, **kwargs): | |||||||||||||||||
seisman marked this conversation as resolved.
Show resolved
Hide resolved
|
|||||||||||||||||
""" | |||||||||||||||||
A dummy function to mimic pyarrow.timestamp. | |||||||||||||||||
""" | |||||||||||||||||
|
|||||||||||||||||
_HAS_PYARROW = False | |||||||||||||||||
|
|||||||||||||||||
|
|||||||||||||||||
|
@@ -244,6 +256,7 @@ def test_to_numpy_pandas_series_pyarrow_dtypes_date(dtype, expected_dtype): | ||||||||||||||||
# - Date types: | |||||||||||||||||
# - date32[day] | |||||||||||||||||
# - date64[ms] | |||||||||||||||||
# - Datetime types: timestamp | |||||||||||||||||
seisman marked this conversation as resolved.
Show resolved
Hide resolved
|
|||||||||||||||||
# | |||||||||||||||||
# In PyArrow, array types can be specified in two ways: | |||||||||||||||||
# | |||||||||||||||||
|
@@ -366,3 +379,43 @@ def test_to_numpy_pyarrow_array_pyarrow_dtypes_date(dtype, expected_dtype): | ||||||||||||||||
result, | |||||||||||||||||
np.array(["2024-01-01", "2024-01-02", "2024-01-03"], dtype=expected_dtype), | |||||||||||||||||
) | |||||||||||||||||
|
|||||||||||||||||
|
|||||||||||||||||
@pytest.mark.skipif(not _HAS_PYARROW, reason="pyarrow is not installed") | |||||||||||||||||
@pytest.mark.parametrize( | |||||||||||||||||
("dtype", "expected_dtype"), | |||||||||||||||||
[ | |||||||||||||||||
pytest.param(None, "datetime64[us]", id="None"), | |||||||||||||||||
pytest.param("timestamp[s]", "datetime64[s]", id="timestamp[s]"), | |||||||||||||||||
pytest.param("timestamp[ms]", "datetime64[ms]", id="timestamp[ms]"), | |||||||||||||||||
pytest.param("timestamp[us]", "datetime64[us]", id="timestamp[us]"), | |||||||||||||||||
pytest.param("timestamp[ns]", "datetime64[ns]", id="timestamp[ns]"), | |||||||||||||||||
pytest.param( | |||||||||||||||||
pa.timestamp("s", tz="UTC"), "datetime64[s]", id="timestamp[s, tz=UTC]" | |||||||||||||||||
), # pa.timestamp with tz has no string alias. | |||||||||||||||||
pytest.param( | |||||||||||||||||
pa.timestamp("s", tz="America/New_York"), | |||||||||||||||||
"datetime64[s]", | |||||||||||||||||
id="timestamp[s, tz=America/New_York]", | |||||||||||||||||
), | |||||||||||||||||
pytest.param( | |||||||||||||||||
pa.timestamp("s", tz="+07:30"), | |||||||||||||||||
"datetime64[s]", | |||||||||||||||||
id="timestamp[s, tz=+07:30]", | |||||||||||||||||
), | |||||||||||||||||
], | |||||||||||||||||
) | |||||||||||||||||
def test_to_numpy_pyarrow_timestamp(dtype, expected_dtype): | |||||||||||||||||
""" | |||||||||||||||||
Test the _to_numpy function with PyArrow arrays of PyArrow datetime types. | |||||||||||||||||
seisman marked this conversation as resolved.
Show resolved
Hide resolved
|
|||||||||||||||||
|
|||||||||||||||||
pyarrow.timestamp(unit, tz=None) can accept units "s", "ms", "us", and "ns". | |||||||||||||||||
|
|||||||||||||||||
Reference: https://arrow.apache.org/docs/python/generated/pyarrow.timestamp.html | |||||||||||||||||
""" | |||||||||||||||||
data = [datetime(2024, 1, 2, 3, 4, 5), datetime(2024, 1, 2, 3, 4, 6)] | |||||||||||||||||
array = pa.array(data, type=dtype) | |||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the timezone part of the data = [datetime(2024, 1, 2, 3, 4, 5), datetime(2024, 1, 2, 3, 4, 6)]
array = pa.array(data, type=pa.timestamp("s", tz="America/New_York"))
print(array)
# [
# 2024-01-02 03:04:05Z,
# 2024-01-02 03:04:06Z
# ]
result = _to_numpy(array)
print(result)
# ['2024-01-02T03:04:05' '2024-01-02T03:04:06'] If using
For the first one, the timezone offset for New York (UTC-5) doesn't seem to be applied? Whereas for the second one, it is converting from New York timezone (UTC-5) to UTC by adding 5 hours. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems the timezone information is applied but not shown when representing the array. In [1]: from datetime import datetime
In [2]: import pyarrow as pa
# For pa.scalar, it's shown with timezone offset
In [3]: pa.scalar(datetime(2024, 1, 2, 3, 4, 5), type=pa.timestamp("s", tz="America/New_York"))
Out[3]: <pyarrow.TimestampScalar: '2024-01-01T22:04:05-0500'>
In [4]: array = pa.array([datetime(2024, 1, 2, 3, 4, 5)], type=pa.timestamp("s", tz="America/New_York"))
# For pa.array, it's shown without timezone offset
In [5]: array
Out[5]:
<pyarrow.lib.TimestampArray object at 0x7f52feace740>
[
2024-01-02 03:04:05Z
]
# When converted to pandas, it's shown with timezone offset
In [6]: array.to_pandas()
Out[6]:
0 2024-01-01 22:04:05-05:00
dtype: datetime64[s, America/New_York]
# When converted to numpy, timezone information is lost, since numpy doesn's have tz support.
In [7]: array.to_numpy()
Out[7]: array(['2024-01-02T03:04:05'], dtype='datetime64[s]') There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Likely related to apache/arrow#40122. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Also need to note the difference between So, in your first example:
In your 2nd example:
So, I think there is no inconsistency. The pyarrow.array has timezone stored internally, but its repr always shows UTC time. And when converting to numpy array (using either its own There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. https://stackoverflow.com/a/73276431
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, I see now that there is indeed consistency, thanks for clearing this up, the repr on pyarrow.array is indeed confusing 😅 . So in summary:
Since we're converting to NumPy, the timezone information will always be lost (everything is converted to UTC). What does this mean for PyGMT/GMT? Does it mean that users cannot plot data in a specific timezone, because it will always convert to UTC? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As shown below, GMT itself doesn't have timezone support, similar to numpy.datetime64. The big difference is that GMT simply ignore the TZ infor, while the to_numpy conversion always converts to UTC.
I think yes, unless we try hard to drop the TZ information before converting to numpy.datetime64. Below is an example, but if we want to support TZ, we need to find a more general way to do it. In [1]: import pyarrow as pa
In [2]: from datetime import datetime
In [3]: data = [datetime(2024, 1, 2, 3, 4, 5), datetime(2024, 1, 2, 3, 4, 6)]
In [4]: array = pa.array(data, type=pa.timestamp("s", tz="America/New_York"))
In [5]: array
Out[5]:
<pyarrow.lib.TimestampArray object at 0x7f78c07bafe0>
[
2024-01-02 03:04:05Z,
2024-01-02 03:04:06Z
]
In [6]: array.to_pandas().dt.tz_localize(None)
Out[6]:
0 2024-01-01 22:04:05
1 2024-01-01 22:04:06
dtype: datetime64[s]
In [7]: import numpy as np
In [8]: np.ascontiguousarray(array.to_pandas().dt.tz_localize(None))
Out[8]:
array(['2024-01-01T22:04:05', '2024-01-01T22:04:06'],
dtype='datetime64[s]') There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we need to open an issue to discuss this - whether to:
If going with 2, we should at least raise a warning if a non-UTC timezone is used, that a conversion is taking place. If going with 1, we would need to special-case datetime types, that might mean extra logic in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm also debating whether to following option 1 or 2, so a separate issue sounds good. Since the current behavior is option 2 (as done in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, I'll open an issue to discuss (edit: see #3656). And sure, we can go with option 2 for now. |
|||||||||||||||||
result = _to_numpy(array) | |||||||||||||||||
_check_result(result, np.datetime64) | |||||||||||||||||
assert result.dtype == expected_dtype | |||||||||||||||||
npt.assert_array_equal(result, array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pa.timestamp()
withtz
doesn't have string aliases, so we can't use strings liketimestamp[s, UTC]
. So we have to define a dummy function here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we avoid this dummy class by starting from a
pandas.Timestamp
instead of Pythondatetime
? The timezone info can be carried over:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, but I feel it will make the pytest parametrize more complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's keep this dummy class then.