-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Is v = np.array(v.dt.to_pydatetime())
still necessary?
#4836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I did a bit more digging on this issue to gather more background — @MarcoGorelli @FBruzzesi Hopefully this lines up with what you've found. As described in #1160, the reported bug in df = pd.DataFrame({
'date': pd.to_datetime(["2024-01-01 01:00", "2024-01-01 02:00"]),
'value': [100, 200],
})
fig = go.Figure(
data=[go.Scatter(x=df['date'], y=df['value'])],
)
print(fig.data[0].x.__repr__()) produced the output array([1704070800000000000, 1704074400000000000], dtype=object) and resulted in the following chart: This bug was a side effect of the changes in #1149 (specifically it seems the addition of So, #1163 revised the logic to call array([datetime.datetime(2024, 1, 1, 1, 0), datetime.datetime(2024, 1, 1, 2, 0)], dtype=object) and the chart was plotted correctly: (Incidentally, the behavior of A test was also added in #1163 to verify that when passing a pandas datetime series as input, the corresponding The latter behavior holds to this day on Interestingly, it seems to me that pandas datetime series passed into a array(['2024-01-01T01:00:00.000000000', '2024-01-01T02:00:00.000000000'], dtype='datetime64[ns]') The current implementation of the narwhals adoption follows the latter behavior, but extends it to array(['2024-01-01T01:00:00.000000000', '2024-01-01T02:00:00.000000000'], dtype='datetime64[ns]') The test added in #1163 was also modified to check that the So the real question is — is that a problem? It seems to me that as long as they serialize to the same JSON as python datetimes, and the resulting charts are identical, it shouldn't be an issue — though of course I may be missing something. And since as far as I can tell, on "x":["2024-01-01T01:00:00","2024-01-01T02:00:00"] on Narwhals branch ( "x":["2024-01-01T01:00:00.000000000","2024-01-01T02:00:00.000000000"] The only difference I can see is the precision, I am not sure we need nanosecond precision but it also seems silly to remove it if it comes by default. The generated charts are identical. |
Thanks for taking a careful look 🙏 ! Yup exactly - some tests which needed to check the type of the return object needed changing, but the user-facing behaviour seems to be unchanged I think any benchmark involving time series should show a noticeable performance boost, because it's a lot more performant to go from a pandas Series to a datetime64 numpy array than it is to go to an object array of Timestamp objects (e.g. with 100_000 elements, it's 13 thousand times faster): In [30]: s = pd.Series(pd.date_range('2000', periods=100_000, freq='h'))
In [31]: results = %timeit -o np.array(s.dt.to_pydatetime())
<magic-timeit>:1: FutureWarning: The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result
36 ms ± 3.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [32]: results.best
Out[32]: 0.033226083499903324
In [33]: results = %timeit -o s.to_numpy()
2.6 μs ± 133 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [34]: results.best
Out[34]: 2.5290634100019817e-06
In [35]: 0.033226083499903324 / 2.5290634100019817e-06
Out[35]: 13137.702822515346 |
Seems fine to me. Nanosecond precision is a bit more than we can use - IIRC we do a bit better than native JS Date objects, which stop at milliseconds, but the only by about one extra digit (we represent dates internally as floats, so close to 1970 you probably can get more digits, but conversely a few hundred years into the future or past you’ll get fewer digits). But it’s probably not worth removing the extra digits, the time saved on network transfer and deserializing will likely be balanced by the overhead during serialization. Would be cool if we could get this to serialize and deserialize as a typedarray, but that’s a whole other can of worms 😬 |
thanks all - closing as addressed by #4790 |
As far as I can tell, the lines
plotly.py/packages/python/plotly/_plotly_utils/basevalidators.py
Lines 101 to 108 in 960adb9
were introduced in #1163 to introduce issues with displaying numpy datetime64 arrays
However, have the numpy datetime64 issues since been fixed? From having built Polars from source, then here's what I see on the master branch:
Looks like it displays fine
If I apply the diff
then it looks like pandas datetime Series still display fine
Asking in the context of #4790, as
copy_to_readonly_numpy_array
would need to handle other kinds of inputs (not just pandas series / index)A plain conversion to numpy would be a lot faster than going via stdlib datetime objects
The text was updated successfully, but these errors were encountered: