-
-
Notifications
You must be signed in to change notification settings - Fork 74
🐛 parse object arrays for hf_x #116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #116 +/- ##
==========================================
+ Coverage 97.46% 97.49% +0.03%
==========================================
Files 11 11
Lines 867 878 +11
==========================================
+ Hits 845 856 +11
Misses 22 22
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Have you found any time to look into my remarks @jvdd? |
LGTM! Also tested the basic example notebook and the |
) | ||
# Check and update timezones of the hf_x data when there are multiple | ||
# timezones in the data | ||
if hf_x.dtype == "object": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this fix! I have a couple of suggestions that I will outline below. Don't hesitate to ignore them if you don't feel like it's the right thing!
- So considering this first condition
dtype== 'object'
that you test, this will be the case, for example, when there are mixed TZs in the input data. But "object" could correspond to other types as well (anything pandas do not recognize), so I would replace it instead with
dtype == 'object' and isinstance(hf_x[0], (pd.Timestamp, datetime))
,
which for me defines specifically the case of Timestamps with mixed TZs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are completely right! Actually the kernel crash you reported in #115 is because we missed the datetime.datetime
data
UserWarning, | ||
) | ||
hf_x = np.asarray( | ||
list(map(lambda x: x.replace(tzinfo=None), hf_x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- When the TZs are mixed instead of dropping them altogether, I suggest you convert them to UTC (there is no daylight-saving time in UTC, so it always works). Also, in pandas
map(replace,..)
is slow because each individual datetime is parsed and then pandas reparse them. Consider using the.dt
accessor instead (it should be optimized internally). For this particular line/operation, you would writepd.to_datetime(hf_x, utc=True)
. If those were indeed all various datetimes, it will convert them correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe converting hf_x
to UTC changes the visualization. As plotly can cope with visualizing multiple time-zones in the same x-index, I would rather support the same functionality.. What do you think @Alexander-Serov? Perhaps @jonasvdd can also weigh in on this :)
Example of effect of converting to UTC:
The code:
y = np.arange(20)
index1 = pd.date_range('2018-01-01', periods=10, freq='H', tz="US/Eastern")
index2 = pd.date_range('2018-01-02', periods=10, freq='H', tz="Asia/Dubai")
index = index1.append(index2)
fig = go.Figure(make_subplots(rows=2, cols=1, shared_xaxes=True))
fig.add_trace(go.Scattergl(x=index, y=y, name="Vanilla plotly", mode="markers+lines"), row=1, col=1)
fig.add_trace(go.Scattergl(x=pd.to_datetime(index, utc=True), y=y, name="UTC convert", mode="markers+lines"), row=2, col=1)
fig.show()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I tend to agree with Jeroen that plotly-resampler's behavior should replicate plotly's current behavior as much as possible. This PR's current implementation (not optimized, but is not 100% severe, as we only perform this within the add_trace
function) complies with the plotly behavior (+ outputting a warning). So I would say that the current way it is implemented is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
plotly_resampler/figure_resampler/figure_resampler_interface.py
Outdated
Show resolved
Hide resolved
@@ -686,6 +688,38 @@ def _parse_get_trace_props( | |||
if isinstance(hf_hovertext, np.ndarray): | |||
hf_hovertext = hf_hovertext[not_nan_mask] | |||
|
|||
# Try to parse the hf_x data if it non datetime-like values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: ... if it non datetime...
--> if it consists of non datetime-like
items and has object
/str
as global dtype
except ValueError: | ||
try: | ||
# Try to parse to datetime | ||
hf_x = np.asarray([pd.Timestamp(x) for x in hf_x]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we keep this implementation, we should add a note why we perform this inline for loop with pd.Timestamp
) | ||
# Check and update timezones of the hf_x data when there are multiple | ||
# timezones in the data | ||
if len(hf_x) and hf_x.dtype == "object" and isinstance(hf_x[0], (pd.Timestamp, datetime.datetime)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is hit when the if-statement code above (see snippet ⬇️) is hit with for example timestamp-strings of multiple timezone offsets
# Try to parse to datetime
hf_x = np.asarray([pd.Timestamp(x) for x in hf_x])
UserWarning, | ||
) | ||
|
||
hf_x = [x.replace(tzinfo=None) for x in hf_x] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this indeed replicates the default plotly-behavior, very nice!
list(map(lambda x: x.replace(tzinfo=None), hf_x)) | ||
) | ||
|
||
hf_x = [x.replace(tzinfo=None) for x in hf_x] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if this cannot be performed with a pandas (optimized) function. Something like x.tz_replace(None)
, see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.tz_localize.html
plotly_resampler/figure_resampler/figure_resampler_interface.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Squash merged this into main! Do we create a new release? |
This PR does the following;
hf_x
to numeric or datetime ifhf_x
is an array of dtype object.-> object arrays were not handled as before + we've now add plotly-like support for multiple timezones in same x-arrray
register_plotly_resampler
) + datetime index working really slow #120-> issue stemmed from object array of datetimes being used, parsing this with
pd.to_datetime
solves this❗ When there are multiple time-zones in the same
x
array, a ValueError will be thrown.This is NOT in line with plotly.py its behavior (as plotly.py allows to create plots with multiple time-zones in the same x-array) - but we believe that this behavior makes sense for plotly-resampler as