Skip to content

clib.conversion._to_numpy: Add tests for PyArrow's timestamp type #3621

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Nov 28, 2024
53 changes: 53 additions & 0 deletions pygmt/tests/test_clib_to_numpy.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,18 @@

_HAS_PYARROW = True
except ImportError:

class pa: # noqa: N801
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pa.timestamp() with tz doesn't have string aliases, so we can't use strings like timestamp[s, UTC]. So we have to define a dummy function here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid this dummy class by starting from a pandas.Timestamp instead of Python datetime? The timezone info can be carried over:

import pyarrow as pa
import pandas as pd
from zoneinfo import ZoneInfo

array = pa.array(
    [
        pd.Timestamp(2024, 1, 2, 3, 4, 5, tzinfo=ZoneInfo("America/New_York")),
        pd.Timestamp(2024, 1, 2, 3, 4, 6, tzinfo=ZoneInfo("America/New_York")),
    ]
)
print(array.type)
# timestamp[us, tz=America/New_York]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but I feel it will make the pytest parametrize more complicated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's keep this dummy class then.

"""
A dummy class to mimic pyarrow.
"""

@staticmethod
def timestamp(*args, **kwargs):
"""
A dummy function to mimic pyarrow.timestamp.
"""

_HAS_PYARROW = False


Expand Down Expand Up @@ -244,6 +256,7 @@ def test_to_numpy_pandas_series_pyarrow_dtypes_date(dtype, expected_dtype):
# - Date types:
# - date32[day]
# - date64[ms]
# - Datetime types: timestamp
#
# In PyArrow, array types can be specified in two ways:
#
Expand Down Expand Up @@ -366,3 +379,43 @@ def test_to_numpy_pyarrow_array_pyarrow_dtypes_date(dtype, expected_dtype):
result,
np.array(["2024-01-01", "2024-01-02", "2024-01-03"], dtype=expected_dtype),
)


@pytest.mark.skipif(not _HAS_PYARROW, reason="pyarrow is not installed")
@pytest.mark.parametrize(
("dtype", "expected_dtype"),
[
pytest.param(None, "datetime64[us]", id="None"),
pytest.param("timestamp[s]", "datetime64[s]", id="timestamp[s]"),
pytest.param("timestamp[ms]", "datetime64[ms]", id="timestamp[ms]"),
pytest.param("timestamp[us]", "datetime64[us]", id="timestamp[us]"),
pytest.param("timestamp[ns]", "datetime64[ns]", id="timestamp[ns]"),
pytest.param(
pa.timestamp("s", tz="UTC"), "datetime64[s]", id="timestamp[s, tz=UTC]"
), # pa.timestamp with tz has no string alias.
pytest.param(
pa.timestamp("s", tz="America/New_York"),
"datetime64[s]",
id="timestamp[s, tz=America/New_York]",
),
pytest.param(
pa.timestamp("s", tz="+07:30"),
"datetime64[s]",
id="timestamp[s, tz=+07:30]",
),
],
)
def test_to_numpy_pyarrow_timestamp(dtype, expected_dtype):
"""
Test the _to_numpy function with PyArrow arrays of PyArrow datetime types.

pyarrow.timestamp(unit, tz=None) can accept units "s", "ms", "us", and "ns".

Reference: https://arrow.apache.org/docs/python/generated/pyarrow.timestamp.html
"""
data = [datetime(2024, 1, 2, 3, 4, 5), datetime(2024, 1, 2, 3, 4, 6)]
array = pa.array(data, type=dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the timezone part of the pa.timestamp type actually working?

data = [datetime(2024, 1, 2, 3, 4, 5), datetime(2024, 1, 2, 3, 4, 6)]
array = pa.array(data, type=pa.timestamp("s", tz="America/New_York"))
print(array)
# [
#   2024-01-02 03:04:05Z,
#   2024-01-02 03:04:06Z
# ]
result = _to_numpy(array)
print(result)
# ['2024-01-02T03:04:05' '2024-01-02T03:04:06']

If using pd.Timestamp instead

import pandas as pd
import pyarrow as pa
from datetime import datetime, timezone
from zoneinfo import ZoneInfo
from pygmt.clib.conversion import _to_numpy

array = pa.array(
    [
        pd.Timestamp(2024, 1, 2, 3, 4, 5, tzinfo=ZoneInfo("America/New_York")),
        pd.Timestamp(2024, 1, 2, 3, 4, 6, tzinfo=ZoneInfo("America/New_York")),
    ]
)
print(array)
# [
#   2024-01-02 08:04:05.000000Z,
#   2024-01-02 08:04:06.000000Z
# ]
result = _to_numpy(array)
print(result)
# ['2024-01-02T08:04:05.000000' '2024-01-02T08:04:06.000000']

For the first one, the timezone offset for New York (UTC-5) doesn't seem to be applied? Whereas for the second one, it is converting from New York timezone (UTC-5) to UTC by adding 5 hours.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the timezone information is applied but not shown when representing the array.

In [1]: from datetime import datetime

In [2]: import pyarrow as pa

# For pa.scalar, it's shown with timezone offset
In [3]: pa.scalar(datetime(2024, 1, 2, 3, 4, 5), type=pa.timestamp("s", tz="America/New_York"))
Out[3]: <pyarrow.TimestampScalar: '2024-01-01T22:04:05-0500'>

In [4]: array = pa.array([datetime(2024, 1, 2, 3, 4, 5)], type=pa.timestamp("s", tz="America/New_York"))
# For pa.array, it's shown without timezone offset
In [5]: array
Out[5]:
<pyarrow.lib.TimestampArray object at 0x7f52feace740>
[
  2024-01-02 03:04:05Z
]

# When converted to pandas, it's shown with timezone offset
In [6]: array.to_pandas()
Out[6]:
0   2024-01-01 22:04:05-05:00
dtype: datetime64[s, America/New_York]

# When converted to numpy, timezone information is lost, since numpy doesn's have tz support.
In [7]: array.to_numpy()
Out[7]: array(['2024-01-02T03:04:05'], dtype='datetime64[s]')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely related to apache/arrow#40122.

Copy link
Member Author

@seisman seisman Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the timezone part of the pa.timestamp type actually working?

data = [datetime(2024, 1, 2, 3, 4, 5), datetime(2024, 1, 2, 3, 4, 6)]
array = pa.array(data, type=pa.timestamp("s", tz="America/New_York"))
print(array)
# [
#   2024-01-02 03:04:05Z,
#   2024-01-02 03:04:06Z
# ]
result = _to_numpy(array)
print(result)
# ['2024-01-02T03:04:05' '2024-01-02T03:04:06']

If using pd.Timestamp instead

import pandas as pd
import pyarrow as pa
from datetime import datetime, timezone
from zoneinfo import ZoneInfo
from pygmt.clib.conversion import _to_numpy

array = pa.array(
    [
        pd.Timestamp(2024, 1, 2, 3, 4, 5, tzinfo=ZoneInfo("America/New_York")),
        pd.Timestamp(2024, 1, 2, 3, 4, 6, tzinfo=ZoneInfo("America/New_York")),
    ]
)
print(array)
# [
#   2024-01-02 08:04:05.000000Z,
#   2024-01-02 08:04:06.000000Z
# ]
result = _to_numpy(array)
print(result)
# ['2024-01-02T08:04:05.000000' '2024-01-02T08:04:06.000000']

For the first one, the timezone offset for New York (UTC-5) doesn't seem to be applied? Whereas for the second one, it is converting from New York timezone (UTC-5) to UTC by adding 5 hours.

Also need to note the difference between datetime and pd.Timestamp. datetime is timezone-unware, while pd.Timestamp has timezone support.

So, in your first example:

  • actual datetime: 2024-01-02T03:04:05+00:00
  • In pyarrow.array, the actual datetime is 2024-01-01T22:04:05-05:00, but UTC time (2024-01-02T03:04:05) is shown
  • to_numpy: 2024-01-02T03:04:05 (UTC time)

In your 2nd example:

  • the actual datetime: 2024-01-02T03:04:05-05:00
  • In pyarrow.array: the UTC time (2024-01-02 08:04:05) is shown
  • to_numpy: 2024-01-02T08:04:05

So, I think there is no inconsistency. The pyarrow.array has timezone stored internally, but its repr always shows UTC time. And when converting to numpy array (using either its own to_numpy() method or np.ascontinuousarray()), the datetime are always converted to UTC since numpy.datetime64 doesn't have timezone support.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://stackoverflow.com/a/73276431

Arrow internally stores datetime as UTC + timezone info and will print it as such

Copy link
Member

@weiji14 weiji14 Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see now that there is indeed consistency, thanks for clearing this up, the repr on pyarrow.array is indeed confusing 😅 . So in summary:

class/type has timezone support Link
Python datetime https://docs.python.org/3/library/datetime.html#datetime.datetime.tzinfo
pandas.Timestamp https://pandas.pydata.org/pandas-docs/version/2.2/reference/api/pandas.Timestamp.html
pyarrow.timestamp (type) https://arrow.apache.org/docs/17.0/python/generated/pyarrow.timestamp.html
numpy.datetime64 https://numpy.org/doc/2.1/reference/arrays.scalars.html#numpy.datetime64

Since we're converting to NumPy, the timezone information will always be lost (everything is converted to UTC). What does this mean for PyGMT/GMT? Does it mean that users cannot plot data in a specific timezone, because it will always convert to UTC?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As shown below, GMT itself doesn't have timezone support, similar to numpy.datetime64. The big difference is that GMT simply ignore the TZ infor, while the to_numpy conversion always converts to UTC.

gmt begin map
gmt basemap -R2024-01-01T00:00:00+08:00/2024-01-02T00:00:00+08:00/1/10 -JX10c/4c -Baf 
echo 2024-01-01T08:00:00+08:00 5 | gmt plot -Sc1c -Gblack
gmt end show

map

Does it mean that users cannot plot data in a specific timezone, because it will always convert to UTC?

I think yes, unless we try hard to drop the TZ information before converting to numpy.datetime64. Below is an example, but if we want to support TZ, we need to find a more general way to do it.

In [1]: import pyarrow as pa

In [2]: from datetime import datetime

In [3]: data = [datetime(2024, 1, 2, 3, 4, 5), datetime(2024, 1, 2, 3, 4, 6)]

In [4]: array = pa.array(data, type=pa.timestamp("s", tz="America/New_York"))

In [5]: array
Out[5]:
<pyarrow.lib.TimestampArray object at 0x7f78c07bafe0>
[
  2024-01-02 03:04:05Z,
  2024-01-02 03:04:06Z
]

In [6]: array.to_pandas().dt.tz_localize(None)
Out[6]:
0   2024-01-01 22:04:05
1   2024-01-01 22:04:06
dtype: datetime64[s]

In [7]: import numpy as np

In [8]: np.ascontiguousarray(array.to_pandas().dt.tz_localize(None))
Out[8]:
array(['2024-01-01T22:04:05', '2024-01-01T22:04:06'],
      dtype='datetime64[s]')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to open an issue to discuss this - whether to:

  1. Follow GMT (allow users to plot data at a non-UTC timezone, by ignoring the timezone offset)
  2. Follow NumPy, whereby the data will from any non-UTC timezone will be converted to UTC always.

If going with 2, we should at least raise a warning if a non-UTC timezone is used, that a conversion is taking place. If going with 1, we would need to special-case datetime types, that might mean extra logic in the _to_numpy function, or having to keep array_to_datetime (so don't merge #3507 yet).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also debating whether to following option 1 or 2, so a separate issue sounds good.

Since the current behavior is option 2 (as done in array_to_datetime), I think we should go with option 2 first, then revisit the timezone support later, which will be a breaking change anyway.

Copy link
Member

@weiji14 weiji14 Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll open an issue to discuss (edit: see #3656). And sure, we can go with option 2 for now.

result = _to_numpy(array)
_check_result(result, np.datetime64)
assert result.dtype == expected_dtype
npt.assert_array_equal(result, array)