Skip to content

Is v = np.array(v.dt.to_pydatetime()) still necessary? #4836

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MarcoGorelli opened this issue Oct 27, 2024 · 4 comments
Closed

Is v = np.array(v.dt.to_pydatetime()) still necessary? #4836

MarcoGorelli opened this issue Oct 27, 2024 · 4 comments
Assignees
Labels
feature something new P2 considered for next cycle

Comments

@MarcoGorelli
Copy link
Contributor

As far as I can tell, the lines

elif v.dtype.kind == "M":
# Convert datetime Series/Index to numpy array of datetimes
if isinstance(v, pd.Series):
with warnings.catch_warnings():
warnings.simplefilter("ignore", FutureWarning)
# Series.dt.to_pydatetime will return Index[object]
# https://github.com/pandas-dev/pandas/pull/52459
v = np.array(v.dt.to_pydatetime())

were introduced in #1163 to introduce issues with displaying numpy datetime64 arrays

However, have the numpy datetime64 issues since been fixed? From having built Polars from source, then here's what I see on the master branch:

Image

Looks like it displays fine

If I apply the diff

--- a/packages/python/plotly/_plotly_utils/basevalidators.py
+++ b/packages/python/plotly/_plotly_utils/basevalidators.py
@@ -95,20 +95,7 @@ def copy_to_readonly_numpy_array(v, kind=None, force_numeric=False):
 
     # Handle pandas Series and Index objects
     if pd and isinstance(v, (pd.Series, pd.Index)):
-        if v.dtype.kind in numeric_kinds:
-            # Get the numeric numpy array so we use fast path below
-            v = v.values
-        elif v.dtype.kind == "M":
-            # Convert datetime Series/Index to numpy array of datetimes
-            if isinstance(v, pd.Series):
-                with warnings.catch_warnings():
-                    warnings.simplefilter("ignore", FutureWarning)
-                    # Series.dt.to_pydatetime will return Index[object]
-                    # https://github.com/pandas-dev/pandas/pull/52459
-                    v = np.array(v.dt.to_pydatetime())
-            else:
-                # DatetimeIndex
-                v = v.to_pydatetime()
+        v = v.values

then it looks like pandas datetime Series still display fine

Image


Asking in the context of #4790, as copy_to_readonly_numpy_array would need to handle other kinds of inputs (not just pandas series / index)

A plain conversion to numpy would be a lot faster than going via stdlib datetime objects

In [23]: %time np.array(s.dt.to_pydatetime())
CPU times: user 325 ms, sys: 8.34 ms, total: 333 ms
Wall time: 360 ms
Out[23]:
array([datetime.datetime(2000, 1, 1, 0, 0),
       datetime.datetime(2000, 1, 1, 1, 0),
       datetime.datetime(2000, 1, 1, 2, 0), ...,
       datetime.datetime(2114, 1, 29, 13, 0),
       datetime.datetime(2114, 1, 29, 14, 0),
       datetime.datetime(2114, 1, 29, 15, 0)], dtype=object)

In [24]: %time s.to_numpy()
CPU times: user 46 μs, sys: 0 ns, total: 46 μs
Wall time: 51.5 μs
Out[24]:
array(['2000-01-01T00:00:00.000000000', '2000-01-01T01:00:00.000000000',
       '2000-01-01T02:00:00.000000000', ...,
       '2114-01-29T13:00:00.000000000', '2114-01-29T14:00:00.000000000',
       '2114-01-29T15:00:00.000000000'], dtype='datetime64[ns]')
@gvwilson gvwilson added feature something new P2 considered for next cycle labels Oct 28, 2024
@emilykl
Copy link
Contributor

emilykl commented Nov 12, 2024

I did a bit more digging on this issue to gather more background — @MarcoGorelli @FBruzzesi Hopefully this lines up with what you've found.

As described in #1160, the reported bug in plotly==3.2.0 was that pandas datetime series passed into figures became transformed into integers — for example:

df = pd.DataFrame({
    'date': pd.to_datetime(["2024-01-01 01:00", "2024-01-01 02:00"]),
    'value': [100, 200],
})
fig = go.Figure(
    data=[go.Scatter(x=df['date'], y=df['value'])],
)
print(fig.data[0].x.__repr__())

produced the output

array([1704070800000000000, 1704074400000000000], dtype=object)

and resulted in the following chart:

Show chart

Image


This bug was a side effect of the changes in #1149 (specifically it seems the addition of v = v.values caused the lower-down line new_v = np.array(v, dtype='object') to convert an array of pandas datetimes to an array with dtype=object whose individual items are integers).

So, #1163 revised the logic to call to_pydatetime (which returns an ndarray) on datetime series, rather than assigning v = v.values. After that change (first released in plotly==3.2.1), pandas datetime series were converted to ndarrays of python datetimes in fig.data:

array([datetime.datetime(2024, 1, 1, 1, 0), datetime.datetime(2024, 1, 1, 2, 0)], dtype=object)

and the chart was plotted correctly:

Show chart

Image

(Incidentally, the behavior of to_pydatetime() returning an ndarray is deprecated and it returns a pandas Series as of pandas==2.2, so the behavior has already changed from what was intended.)

A test was also added in #1163 to verify that when passing a pandas datetime series as input, the corresponding fig.data key consists of python datetime objects.


The latter behavior holds to this day on master for px figures — pandas datetime series passed into a px chart become ndarrays of python datetime objects in fig.data.

Interestingly, it seems to me that pandas datetime series passed into a graph_objects trace on master become ndarrays of np.datetime64 objects:

array(['2024-01-01T01:00:00.000000000', '2024-01-01T02:00:00.000000000'], dtype='datetime64[ns]')

The current implementation of the narwhals adoption follows the latter behavior, but extends it to px as well — pandas datetime series passed into either px or go become ndarrays of np.datetime64 objects in fig.data.

array(['2024-01-01T01:00:00.000000000', '2024-01-01T02:00:00.000000000'], dtype='datetime64[ns]')

The test added in #1163 was also modified to check that the fig.data key consists of np.datetime64 rather than python datetimes.

So the real question is — is that a problem? It seems to me that as long as they serialize to the same JSON as python datetimes, and the resulting charts are identical, it shouldn't be an issue — though of course I may be missing something. And since as far as I can tell, graph_objects figures were already converting to np.datetime64 anyway, it doesn't seem like a huge risk. The JSON serialization we end up with for python datetimes vs. np.datetime64 is... almost... identical:

on master (px):

"x":["2024-01-01T01:00:00","2024-01-01T02:00:00"]

on Narwhals branch (graph_objects and px):

"x":["2024-01-01T01:00:00.000000000","2024-01-01T02:00:00.000000000"]

The only difference I can see is the precision, I am not sure we need nanosecond precision but it also seems silly to remove it if it comes by default. The generated charts are identical.

@MarcoGorelli
Copy link
Contributor Author

Thanks for taking a careful look 🙏 !

Yup exactly - some tests which needed to check the type of the return object needed changing, but the user-facing behaviour seems to be unchanged

I think any benchmark involving time series should show a noticeable performance boost, because it's a lot more performant to go from a pandas Series to a datetime64 numpy array than it is to go to an object array of Timestamp objects (e.g. with 100_000 elements, it's 13 thousand times faster):

In [30]: s = pd.Series(pd.date_range('2000', periods=100_000, freq='h'))

In [31]: results = %timeit -o np.array(s.dt.to_pydatetime())
<magic-timeit>:1: FutureWarning: The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result
36 ms ± 3.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [32]: results.best
Out[32]: 0.033226083499903324

In [33]: results = %timeit -o s.to_numpy()
2.6 μs ± 133 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [34]: results.best
Out[34]: 2.5290634100019817e-06

In [35]: 0.033226083499903324 / 2.5290634100019817e-06
Out[35]: 13137.702822515346

@alexcjohnson
Copy link
Collaborator

Seems fine to me. Nanosecond precision is a bit more than we can use - IIRC we do a bit better than native JS Date objects, which stop at milliseconds, but the only by about one extra digit (we represent dates internally as floats, so close to 1970 you probably can get more digits, but conversely a few hundred years into the future or past you’ll get fewer digits). But it’s probably not worth removing the extra digits, the time saved on network transfer and deserializing will likely be balanced by the overhead during serialization.

Would be cool if we could get this to serialize and deserialize as a typedarray, but that’s a whole other can of worms 😬

@MarcoGorelli
Copy link
Contributor Author

thanks all - closing as addressed by #4790

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature something new P2 considered for next cycle
Projects
None yet
Development

No branches or pull requests

4 participants