Skip to content

🐛 parse object arrays for hf_x #116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Dec 2, 2022
Merged

🐛 parse object arrays for hf_x #116

merged 16 commits into from
Dec 2, 2022

Conversation

jvdd
Copy link
Member

@jvdd jvdd commented Sep 7, 2022

This PR does the following;


❗ When there are multiple time-zones in the same x array, a ValueError will be thrown.
This is NOT in line with plotly.py its behavior (as plotly.py allows to create plots with multiple time-zones in the same x-array) - but we believe that this behavior makes sense for plotly-resampler as

  • in most of the times this use-case occurs because there is no timezone in the data and there are 2 fixed_offsets (due to DST) -> the error notifies the user to convert the data to a single timezone data array
  • if it is really the 1% of the cases where the user wants to plot multiple time zones in the same x-array, it is still possible to plot (with a legend_group) for each distinct time zone
  • parsing multiple time-zones in the same x-array is really slow 🐌 + more code to maintain 🥲

@codecov-commenter
Copy link

codecov-commenter commented Sep 7, 2022

Codecov Report

Merging #116 (5efd2e3) into main (71b4efe) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #116      +/-   ##
==========================================
+ Coverage   97.46%   97.49%   +0.03%     
==========================================
  Files          11       11              
  Lines         867      878      +11     
==========================================
+ Hits          845      856      +11     
  Misses         22       22              
Impacted Files Coverage Δ
plotly_resampler/aggregation/__init__.py 100.00% <100.00%> (ø)
...ler/figure_resampler/figure_resampler_interface.py 99.73% <100.00%> (+<0.01%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@jvdd jvdd requested a review from jonasvdd September 9, 2022 09:05
@jonasvdd
Copy link
Member

Have you found any time to look into my remarks @jvdd?

@jonasvdd
Copy link
Member

LGTM!

Also tested the basic example notebook and the pandas plotting backend datetime issue is not yet resolved 😢

)
# Check and update timezones of the hf_x data when there are multiple
# timezones in the data
if hf_x.dtype == "object":

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix! I have a couple of suggestions that I will outline below. Don't hesitate to ignore them if you don't feel like it's the right thing!

  1. So considering this first condition dtype== 'object' that you test, this will be the case, for example, when there are mixed TZs in the input data. But "object" could correspond to other types as well (anything pandas do not recognize), so I would replace it instead with
    dtype == 'object' and isinstance(hf_x[0], (pd.Timestamp, datetime)),
    which for me defines specifically the case of Timestamps with mixed TZs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are completely right! Actually the kernel crash you reported in #115 is because we missed the datetime.datetime data

UserWarning,
)
hf_x = np.asarray(
list(map(lambda x: x.replace(tzinfo=None), hf_x))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. When the TZs are mixed instead of dropping them altogether, I suggest you convert them to UTC (there is no daylight-saving time in UTC, so it always works). Also, in pandas map(replace,..) is slow because each individual datetime is parsed and then pandas reparse them. Consider using the .dt accessor instead (it should be optimized internally). For this particular line/operation, you would write pd.to_datetime(hf_x, utc=True). If those were indeed all various datetimes, it will convert them correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe converting hf_x to UTC changes the visualization. As plotly can cope with visualizing multiple time-zones in the same x-index, I would rather support the same functionality.. What do you think @Alexander-Serov? Perhaps @jonasvdd can also weigh in on this :)

Example of effect of converting to UTC:
image

The code:

y = np.arange(20)
index1 = pd.date_range('2018-01-01', periods=10, freq='H', tz="US/Eastern")
index2 = pd.date_range('2018-01-02', periods=10, freq='H', tz="Asia/Dubai")
index = index1.append(index2)

fig = go.Figure(make_subplots(rows=2, cols=1, shared_xaxes=True))
fig.add_trace(go.Scattergl(x=index, y=y, name="Vanilla plotly", mode="markers+lines"), row=1, col=1)
fig.add_trace(go.Scattergl(x=pd.to_datetime(index, utc=True), y=y, name="UTC convert", mode="markers+lines"), row=2, col=1)
fig.show()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I tend to agree with Jeroen that plotly-resampler's behavior should replicate plotly's current behavior as much as possible. This PR's current implementation (not optimized, but is not 100% severe, as we only perform this within the add_trace function) complies with the plotly behavior (+ outputting a warning). So I would say that the current way it is implemented is fine.

Copy link
Member

@jonasvdd jonasvdd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -686,6 +688,38 @@ def _parse_get_trace_props(
if isinstance(hf_hovertext, np.ndarray):
hf_hovertext = hf_hovertext[not_nan_mask]

# Try to parse the hf_x data if it non datetime-like values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: ... if it non datetime...

--> if it consists of non datetime-like items and has object/str as global dtype

except ValueError:
try:
# Try to parse to datetime
hf_x = np.asarray([pd.Timestamp(x) for x in hf_x])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we keep this implementation, we should add a note why we perform this inline for loop with pd.Timestamp

)
# Check and update timezones of the hf_x data when there are multiple
# timezones in the data
if len(hf_x) and hf_x.dtype == "object" and isinstance(hf_x[0], (pd.Timestamp, datetime.datetime)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is hit when the if-statement code above (see snippet ⬇️) is hit with for example timestamp-strings of multiple timezone offsets

# Try to parse to datetime
hf_x = np.asarray([pd.Timestamp(x) for x in hf_x])

UserWarning,
)

hf_x = [x.replace(tzinfo=None) for x in hf_x]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this indeed replicates the default plotly-behavior, very nice!

list(map(lambda x: x.replace(tzinfo=None), hf_x))
)

hf_x = [x.replace(tzinfo=None) for x in hf_x]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if this cannot be performed with a pandas (optimized) function. Something like x.tz_replace(None), see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.tz_localize.html

Copy link
Member

@jonasvdd jonasvdd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jvdd jvdd merged commit 1b8ee36 into main Dec 2, 2022
@jvdd
Copy link
Member Author

jvdd commented Dec 2, 2022

Squash merged this into main! Do we create a new release?

@jvdd jvdd deleted the hf_x_object branch December 12, 2022 07:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants