🐛 parse object arrays for hf_x #116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

jvdd merged 16 commits into main from hf_x_object

Dec 2, 2022

Member

jvdd commented Sep 7, 2022 •

edited

Loading

This PR does the following;

Parse hf_x to numeric or datetime if hf_x is an array of dtype object.
- should fix After upgrading from 0.3 to 0.8.1, one of my notebook cells with resampler runs indefinitely #115
  -> object arrays were not handled as before + we've now add plotly-like support for multiple timezones in same x-arrray
- should fix pandas plotting back-end = "plotly" (and register_plotly_resampler) + datetime index working really slow #120
  -> issue stemmed from object array of datetimes being used, parsing this with pd.to_datetime solves this
Update poetry install in CI-CD (see 🙏 #117)
Extend tests & validate whether this is inline with plotly behavior

❗ When there are multiple time-zones in the same x array, a ValueError will be thrown.
This is NOT in line with plotly.py its behavior (as plotly.py allows to create plots with multiple time-zones in the same x-array) - but we believe that this behavior makes sense for plotly-resampler as

in most of the times this use-case occurs because there is no timezone in the data and there are 2 fixed_offsets (due to DST) -> the error notifies the user to convert the data to a single timezone data array
if it is really the 1% of the cases where the user wants to plot multiple time zones in the same x-array, it is still possible to plot (with a legend_group) for each distinct time zone
parsing multiple time-zones in the same x-array is really slow 🐌 + more code to maintain 🥲

jvdd added 2 commits

September 7, 2022 11:57


          🐛 parse object arrays for datetime or numeric dtypes, fixes #115

ca96471


          🧹

62f15f4

jvdd mentioned this pull request

After upgrading from 0.3 to 0.8.1, one of my notebook cells with resampler runs indefinitely #115

Closed


          🙏 update poetry install in ci-cd testing (#117)

836f5e8

* 🙏

* 🙈

* 🙏

codecov-commenter commented Sep 7, 2022 •

edited

Loading

Codecov Report

Merging #116 (5efd2e3) into main (71b4efe) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #116      +/-   ##
==========================================
+ Coverage   97.46%   97.49%   +0.03%     
==========================================
  Files          11       11              
  Lines         867      878      +11     
==========================================
+ Hits          845      856      +11     
  Misses         22       22

Impacted Files	Coverage Δ
plotly_resampler/aggregation/__init__.py	`100.00% <100.00%> (ø)`
...ler/figure_resampler/figure_resampler_interface.py	`99.73% <100.00%> (+<0.01%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

jvdd added 3 commits

September 8, 2022 15:59


          ♻️ improve object array parsing for multiple tzs

2a65fbd


          ♻️ could be optimized

bfa54a6


          🧹 optimized + extend testing

c0cd89e

jvdd requested a review from jonasvdd

September 9, 2022 09:05


          📝 make exceptions more explicit

448e093

Member

jonasvdd commented Sep 12, 2022

Have you found any time to look into my remarks @jvdd?

jonasvdd reviewed

View reviewed changes

tests/test_figure_resampler.py Outdated Show resolved Hide resolved

tests/test_figure_resampler.py Outdated Show resolved Hide resolved

tests/test_figure_resampler.py Outdated Show resolved Hide resolved

tests/test_figurewidget_resampler.py Show resolved Hide resolved


          🧹 cleanup tests

56dbe62

Member

jonasvdd commented Sep 20, 2022

LGTM!

Also tested the basic example notebook and the pandas plotting backend datetime issue is not yet resolved 😢

Alexander-Serov reviewed

View reviewed changes

plotly_resampler/figure_resampler/figure_resampler_interface.py Outdated

+                                          )
+                          # Check and update timezones of the hf_x data when there are multiple
+                          # timezones in the data
+                          if hf_x.dtype == "object":

Alexander-Serov Sep 26, 2022

Thanks for this fix! I have a couple of suggestions that I will outline below. Don't hesitate to ignore them if you don't feel like it's the right thing!

So considering this first condition dtype== 'object' that you test, this will be the case, for example, when there are mixed TZs in the input data. But "object" could correspond to other types as well (anything pandas do not recognize), so I would replace it instead with
dtype == 'object' and isinstance(hf_x[0], (pd.Timestamp, datetime)),
which for me defines specifically the case of Timestamps with mixed TZs.

Member Author

jvdd Sep 27, 2022

You are completely right! Actually the kernel crash you reported in #115 is because we missed the datetime.datetime data

plotly_resampler/figure_resampler/figure_resampler_interface.py Outdated

+                                          UserWarning,
+                                      )
+                                      hf_x = np.asarray(
+                                          list(map(lambda x: x.replace(tzinfo=None), hf_x))

Alexander-Serov Sep 26, 2022

When the TZs are mixed instead of dropping them altogether, I suggest you convert them to UTC (there is no daylight-saving time in UTC, so it always works). Also, in pandas map(replace,..) is slow because each individual datetime is parsed and then pandas reparse them. Consider using the .dt accessor instead (it should be optimized internally). For this particular line/operation, you would write pd.to_datetime(hf_x, utc=True). If those were indeed all various datetimes, it will convert them correctly.

Member Author

jvdd Sep 27, 2022

I believe converting hf_x to UTC changes the visualization. As plotly can cope with visualizing multiple time-zones in the same x-index, I would rather support the same functionality.. What do you think @Alexander-Serov? Perhaps @jonasvdd can also weigh in on this :)

Example of effect of converting to UTC:

The code:

y = np.arange(20)
index1 = pd.date_range('2018-01-01', periods=10, freq='H', tz="US/Eastern")
index2 = pd.date_range('2018-01-02', periods=10, freq='H', tz="Asia/Dubai")
index = index1.append(index2)

fig = go.Figure(make_subplots(rows=2, cols=1, shared_xaxes=True))
fig.add_trace(go.Scattergl(x=index, y=y, name="Vanilla plotly", mode="markers+lines"), row=1, col=1)
fig.add_trace(go.Scattergl(x=pd.to_datetime(index, utc=True), y=y, name="UTC convert", mode="markers+lines"), row=2, col=1)
fig.show()

Member

jonasvdd Sep 28, 2022

Hi, I tend to agree with Jeroen that plotly-resampler's behavior should replicate plotly's current behavior as much as possible. This PR's current implementation (not optimized, but is not 100% severe, as we only perform this within the add_trace function) complies with the plotly behavior (+ outputting a warning). So I would say that the current way it is implemented is fine.

plotly_resampler/figure_resampler/figure_resampler_interface.py Show resolved Hide resolved

tests/test_figure_resampler.py Outdated Show resolved Hide resolved

jvdd added 2 commits

September 27, 2022 16:09


          🙈 also parse datetime.datetime

74c31d4


          🐌 fix slow pandas backend bug + 🧹

0b22ed4

jonasvdd reviewed

View reviewed changes

Member

jonasvdd left a comment

LGTM

plotly_resampler/figure_resampler/figure_resampler_interface.py Outdated Show resolved Hide resolved

plotly_resampler/figure_resampler/figure_resampler_interface.py Outdated

@@ @@ -686,6 +688,38 @@ def _parse_get_trace_props( @@
                               if isinstance(hf_hovertext, np.ndarray):
                                   hf_hovertext = hf_hovertext[not_nan_mask]
+                          # Try to parse the hf_x data if it non datetime-like values

Member

jonasvdd Sep 28, 2022

typo: ... if it non datetime...

--> if it consists of non datetime-like items and has object/str as global dtype

plotly_resampler/figure_resampler/figure_resampler_interface.py Outdated

+                                  except ValueError:
+                                      try:
+                                          # Try to parse to datetime
+                                          hf_x = np.asarray([pd.Timestamp(x) for x in hf_x])

Member

jonasvdd Sep 28, 2022

if we keep this implementation, we should add a note why we perform this inline for loop with pd.Timestamp

plotly_resampler/figure_resampler/figure_resampler_interface.py Outdated

+                                          )
+                          # Check and update timezones of the hf_x data when there are multiple
+                          # timezones in the data
+                          if len(hf_x) and hf_x.dtype == "object" and isinstance(hf_x[0], (pd.Timestamp, datetime.datetime)):

Member

jonasvdd Sep 28, 2022

I assume this is hit when the if-statement code above (see snippet ⬇️) is hit with for example timestamp-strings of multiple timezone offsets

# Try to parse to datetime
hf_x = np.asarray([pd.Timestamp(x) for x in hf_x])

plotly_resampler/figure_resampler/figure_resampler_interface.py Outdated

+                                      UserWarning,
+                                  )
+                                  hf_x = [x.replace(tzinfo=None) for x in hf_x]

Member

jonasvdd Sep 28, 2022

this indeed replicates the default plotly-behavior, very nice!

Alexander-Serov reviewed

View reviewed changes

plotly_resampler/figure_resampler/figure_resampler_interface.py Outdated

-                                      list(map(lambda x: x.replace(tzinfo=None), hf_x))
-                                  )
+                                  hf_x = [x.replace(tzinfo=None) for x in hf_x]

Alexander-Serov Sep 29, 2022

I was wondering if this cannot be performed with a pandas (optimized) function. Something like x.tz_replace(None), see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.tz_localize.html

jvdd and others added 3 commits

December 1, 2022 15:16


          ♻️ raise ValueError when multiple time zones in hf_x

a4ad256


          Merge branch 'main' into hf_x_object

ccc788b


          🧹

dc2393f

jvdd commented

View reviewed changes

plotly_resampler/figure_resampler/figure_resampler_interface.py Outdated Show resolved Hide resolved


          🖊️ review

8e5ddf5

jonasvdd approved these changes

View reviewed changes

Member

jonasvdd left a comment

LGTM

jonasvdd and others added 2 commits

December 2, 2022 11:00


          🖊️ review

31ad52e


          Merge branch 'main' into hf_x_object

5efd2e3

jvdd merged commit 1b8ee36 into main

Member Author

jvdd commented Dec 2, 2022

Squash merged this into main! Do we create a new release?

jvdd deleted the hf_x_object branch

December 12, 2022 07:57

jonasvdd mentioned this pull request

"Segmentation fault (core dumped)" when using string timestamp for x-axis #153

Closed

jvdd mentioned this pull request

pandas plotting back-end = "plotly" (and register_plotly_resampler) + datetime index working really slow #120

Closed

jvdd mentioned this pull request

Windows: Could not import lttbc #180

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet