Is `v = np.array(v.dt.to_pydatetime())` still necessary? #4836

MarcoGorelli · 2024-10-27T09:07:54Z

As far as I can tell, the lines

plotly.py/packages/python/plotly/_plotly_utils/basevalidators.py

Lines 101 to 108 in 960adb9

    
           elif v.dtype.kind == "M": 
        
               # Convert datetime Series/Index to numpy array of datetimes 
        
               if isinstance(v, pd.Series): 
        
                   with warnings.catch_warnings(): 
        
                       warnings.simplefilter("ignore", FutureWarning) 
        
                       # Series.dt.to_pydatetime will return Index[object] 
        
                       # https://github.com/pandas-dev/pandas/pull/52459 
        
                       v = np.array(v.dt.to_pydatetime())

were introduced in #1163 to introduce issues with displaying numpy datetime64 arrays

However, have the numpy datetime64 issues since been fixed? From having built Polars from source, then here's what I see on the master branch:

Looks like it displays fine

If I apply the diff

--- a/packages/python/plotly/_plotly_utils/basevalidators.py
+++ b/packages/python/plotly/_plotly_utils/basevalidators.py
@@ -95,20 +95,7 @@ def copy_to_readonly_numpy_array(v, kind=None, force_numeric=False):
 
     # Handle pandas Series and Index objects
     if pd and isinstance(v, (pd.Series, pd.Index)):
-        if v.dtype.kind in numeric_kinds:
-            # Get the numeric numpy array so we use fast path below
-            v = v.values
-        elif v.dtype.kind == "M":
-            # Convert datetime Series/Index to numpy array of datetimes
-            if isinstance(v, pd.Series):
-                with warnings.catch_warnings():
-                    warnings.simplefilter("ignore", FutureWarning)
-                    # Series.dt.to_pydatetime will return Index[object]
-                    # https://github.com/pandas-dev/pandas/pull/52459
-                    v = np.array(v.dt.to_pydatetime())
-            else:
-                # DatetimeIndex
-                v = v.to_pydatetime()
+        v = v.values

then it looks like pandas datetime Series still display fine

Asking in the context of #4790, as copy_to_readonly_numpy_array would need to handle other kinds of inputs (not just pandas series / index)

A plain conversion to numpy would be a lot faster than going via stdlib datetime objects

In [23]: %time np.array(s.dt.to_pydatetime())
CPU times: user 325 ms, sys: 8.34 ms, total: 333 ms
Wall time: 360 ms
Out[23]:
array([datetime.datetime(2000, 1, 1, 0, 0),
       datetime.datetime(2000, 1, 1, 1, 0),
       datetime.datetime(2000, 1, 1, 2, 0), ...,
       datetime.datetime(2114, 1, 29, 13, 0),
       datetime.datetime(2114, 1, 29, 14, 0),
       datetime.datetime(2114, 1, 29, 15, 0)], dtype=object)

In [24]: %time s.to_numpy()
CPU times: user 46 μs, sys: 0 ns, total: 46 μs
Wall time: 51.5 μs
Out[24]:
array(['2000-01-01T00:00:00.000000000', '2000-01-01T01:00:00.000000000',
       '2000-01-01T02:00:00.000000000', ...,
       '2114-01-29T13:00:00.000000000', '2114-01-29T14:00:00.000000000',
       '2114-01-29T15:00:00.000000000'], dtype='datetime64[ns]')

The text was updated successfully, but these errors were encountered:

emilykl · 2024-11-12T13:33:31Z

I did a bit more digging on this issue to gather more background — @MarcoGorelli @FBruzzesi Hopefully this lines up with what you've found.

As described in #1160, the reported bug in plotly==3.2.0 was that pandas datetime series passed into figures became transformed into integers — for example:

df = pd.DataFrame({
    'date': pd.to_datetime(["2024-01-01 01:00", "2024-01-01 02:00"]),
    'value': [100, 200],
})
fig = go.Figure(
    data=[go.Scatter(x=df['date'], y=df['value'])],
)
print(fig.data[0].x.__repr__())

produced the output

array([1704070800000000000, 1704074400000000000], dtype=object)

and resulted in the following chart:

Show chart

This bug was a side effect of the changes in #1149 (specifically it seems the addition of v = v.values caused the lower-down line new_v = np.array(v, dtype='object') to convert an array of pandas datetimes to an array with dtype=object whose individual items are integers).

So, #1163 revised the logic to call to_pydatetime (which returns an ndarray) on datetime series, rather than assigning v = v.values. After that change (first released in plotly==3.2.1), pandas datetime series were converted to ndarrays of python datetimes in fig.data:

array([datetime.datetime(2024, 1, 1, 1, 0), datetime.datetime(2024, 1, 1, 2, 0)], dtype=object)

and the chart was plotted correctly:

Show chart

(Incidentally, the behavior of to_pydatetime() returning an ndarray is deprecated and it returns a pandas Series as of pandas==2.2, so the behavior has already changed from what was intended.)

A test was also added in #1163 to verify that when passing a pandas datetime series as input, the corresponding fig.data key consists of python datetime objects.

The latter behavior holds to this day on master for px figures — pandas datetime series passed into a px chart become ndarrays of python datetime objects in fig.data.

Interestingly, it seems to me that pandas datetime series passed into a graph_objects trace on master become ndarrays of np.datetime64 objects:

array(['2024-01-01T01:00:00.000000000', '2024-01-01T02:00:00.000000000'], dtype='datetime64[ns]')

The current implementation of the narwhals adoption follows the latter behavior, but extends it to px as well — pandas datetime series passed into either px or go become ndarrays of np.datetime64 objects in fig.data.

array(['2024-01-01T01:00:00.000000000', '2024-01-01T02:00:00.000000000'], dtype='datetime64[ns]')

The test added in #1163 was also modified to check that the fig.data key consists of np.datetime64 rather than python datetimes.

So the real question is — is that a problem? It seems to me that as long as they serialize to the same JSON as python datetimes, and the resulting charts are identical, it shouldn't be an issue — though of course I may be missing something. And since as far as I can tell, graph_objects figures were already converting to np.datetime64 anyway, it doesn't seem like a huge risk. The JSON serialization we end up with for python datetimes vs. np.datetime64 is... almost... identical:

on master (px):

"x":["2024-01-01T01:00:00","2024-01-01T02:00:00"]

on Narwhals branch (graph_objects and px):

"x":["2024-01-01T01:00:00.000000000","2024-01-01T02:00:00.000000000"]

The only difference I can see is the precision, I am not sure we need nanosecond precision but it also seems silly to remove it if it comes by default. The generated charts are identical.

MarcoGorelli · 2024-11-12T13:54:43Z

Thanks for taking a careful look 🙏 !

Yup exactly - some tests which needed to check the type of the return object needed changing, but the user-facing behaviour seems to be unchanged

I think any benchmark involving time series should show a noticeable performance boost, because it's a lot more performant to go from a pandas Series to a datetime64 numpy array than it is to go to an object array of Timestamp objects (e.g. with 100_000 elements, it's 13 thousand times faster):

In [30]: s = pd.Series(pd.date_range('2000', periods=100_000, freq='h'))

In [31]: results = %timeit -o np.array(s.dt.to_pydatetime())
<magic-timeit>:1: FutureWarning: The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result
36 ms ± 3.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [32]: results.best
Out[32]: 0.033226083499903324

In [33]: results = %timeit -o s.to_numpy()
2.6 μs ± 133 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [34]: results.best
Out[34]: 2.5290634100019817e-06

In [35]: 0.033226083499903324 / 2.5290634100019817e-06
Out[35]: 13137.702822515346

alexcjohnson · 2024-11-12T14:09:48Z

Seems fine to me. Nanosecond precision is a bit more than we can use - IIRC we do a bit better than native JS Date objects, which stop at milliseconds, but the only by about one extra digit (we represent dates internally as floats, so close to 1970 you probably can get more digits, but conversely a few hundred years into the future or past you’ll get fewer digits). But it’s probably not worth removing the extra digits, the time saved on network transfer and deserializing will likely be balanced by the overhead during serialization.

Would be cool if we could get this to serialize and deserialize as a typedarray, but that’s a whole other can of worms 😬

MarcoGorelli · 2024-11-13T17:07:39Z

thanks all - closing as addressed by #4790

gvwilson assigned emilykl Oct 28, 2024

gvwilson added feature something new P2 considered for next cycle labels Oct 28, 2024

MarcoGorelli closed this as completed Nov 13, 2024

MarcoGorelli mentioned this issue Nov 26, 2024

Improper conversion of Pandas-Arrow date to datetime fails #4171

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is `v = np.array(v.dt.to_pydatetime())` still necessary? #4836

Is `v = np.array(v.dt.to_pydatetime())` still necessary? #4836

MarcoGorelli commented Oct 27, 2024

emilykl commented Nov 12, 2024 •

edited

Loading

MarcoGorelli commented Nov 12, 2024

alexcjohnson commented Nov 12, 2024

MarcoGorelli commented Nov 13, 2024

Is v = np.array(v.dt.to_pydatetime()) still necessary? #4836

Is v = np.array(v.dt.to_pydatetime()) still necessary? #4836

Comments

MarcoGorelli commented Oct 27, 2024

emilykl commented Nov 12, 2024 • edited Loading

MarcoGorelli commented Nov 12, 2024

alexcjohnson commented Nov 12, 2024

MarcoGorelli commented Nov 13, 2024

Is `v = np.array(v.dt.to_pydatetime())` still necessary? #4836

Is `v = np.array(v.dt.to_pydatetime())` still necessary? #4836

emilykl commented Nov 12, 2024 •

edited

Loading