-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Fix/time series interpolation is wrong 21351 #56515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 36 commits
ff6d12f
1593af0
db68c2d
dd8b8d3
a04a3a2
efbba10
0294464
537f8bf
901701c
4f78c75
a5bcd45
7d4b4ce
dbae717
05be840
0912249
49a7c4c
a5a7299
6a6fa88
4ebed74
0ee5b8d
b2bc373
6109102
d6af64a
2a86a27
9c90e23
4b2f3dc
d11c162
789c511
4f6d102
c0547b5
d6382f8
4e9a616
c655bf1
e916da9
eaa7e07
649bfa2
4cfbbf1
76794e3
6ad9b26
6555141
48850cc
8eea71c
7f957cf
51e95e0
12bdd90
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -80,6 +80,7 @@ | |
TimedeltaIndex, | ||
timedelta_range, | ||
) | ||
from pandas.core.reshape.concat import concat | ||
|
||
from pandas.tseries.frequencies import ( | ||
is_subperiod, | ||
|
@@ -890,30 +891,59 @@ def interpolate( | |
Freq: 500ms, dtype: float64 | ||
|
||
Internal reindexing with ``asfreq()`` prior to interpolation leads to | ||
an interpolated timeseries on the basis the reindexed timestamps (anchors). | ||
Since not all datapoints from original series become anchors, | ||
it can lead to misleading interpolation results as in the following example: | ||
an interpolated timeseries on the basis of the reindexed timestamps | ||
(anchors). It is assured that all available datapoints from original | ||
series become anchors, so it also works for resampling-cases that lead | ||
to non-aligned timestamps, as in the following example: | ||
|
||
>>> series.resample("400ms").interpolate("linear") | ||
2023-03-01 07:00:00.000 1.0 | ||
2023-03-01 07:00:00.400 1.2 | ||
2023-03-01 07:00:00.800 1.4 | ||
2023-03-01 07:00:01.200 1.6 | ||
2023-03-01 07:00:01.600 1.8 | ||
2023-03-01 07:00:00.400 0.2 | ||
2023-03-01 07:00:00.800 -0.6 | ||
2023-03-01 07:00:01.200 -0.4 | ||
2023-03-01 07:00:01.600 0.8 | ||
2023-03-01 07:00:02.000 2.0 | ||
2023-03-01 07:00:02.400 2.2 | ||
2023-03-01 07:00:02.800 2.4 | ||
2023-03-01 07:00:03.200 2.6 | ||
2023-03-01 07:00:03.600 2.8 | ||
2023-03-01 07:00:02.400 1.6 | ||
2023-03-01 07:00:02.800 1.2 | ||
2023-03-01 07:00:03.200 1.4 | ||
2023-03-01 07:00:03.600 2.2 | ||
2023-03-01 07:00:04.000 3.0 | ||
Freq: 400ms, dtype: float64 | ||
|
||
Note that the series erroneously increases between two anchors | ||
Note that the series correctly decreases between two anchors | ||
``07:00:00`` and ``07:00:02``. | ||
""" | ||
assert downcast is lib.no_default # just checking coverage | ||
result = self._upsample("asfreq") | ||
return result.interpolate( | ||
|
||
# If the original data has timestamps which are not aligned with the | ||
# target timestamps, we need to add those points back to the data frame | ||
# that is supposed to be interpolated. This does not work with | ||
# PeriodIndex, so we skip this case. GH#21351 | ||
obj = self._selected_obj | ||
is_period_index = isinstance(obj.index, PeriodIndex) | ||
|
||
# Skip this step for PeriodIndex | ||
if not is_period_index: | ||
final_index = result.index | ||
if isinstance(final_index, MultiIndex): | ||
raise NotImplementedError( | ||
"Direct interpolation of MultiIndex data frames is not " | ||
"supported. If you tried to resample and interpolate on a " | ||
"grouped data frame, please use:\n" | ||
"`df.groupby(...).apply(lambda x: x.resample(...)." | ||
"interpolate(...), include_groups=False)`" | ||
"\ninstead, as resampling and interpolation has to be " | ||
"performed for each group independently." | ||
) | ||
|
||
missing_data_points_index = obj.index.difference(final_index) | ||
if len(missing_data_points_index) > 0: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in the opposite case where the difference is empty, i think a the sort_index and There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jbrockmendel Can you please clarify? "in the opposite case where the difference is empty" --> then |
||
result = concat( | ||
[result, obj.loc[missing_data_points_index]] | ||
).sort_index() | ||
|
||
result_interpolated = result.interpolate( | ||
method=method, | ||
axis=axis, | ||
limit=limit, | ||
|
@@ -924,6 +954,18 @@ def interpolate( | |
**kwargs, | ||
) | ||
|
||
# No further steps if the original data has a PeriodIndex | ||
if is_period_index: | ||
return result_interpolated | ||
|
||
# Make sure that original data points which do not align with the | ||
# resampled index are removed | ||
result_interpolated = result_interpolated.loc[final_index] | ||
|
||
# Make sure frequency indexes are preserved | ||
result_interpolated.index = final_index | ||
return result_interpolated | ||
|
||
@final | ||
def asfreq(self, fill_value=None): | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this change was incorrect. Reverting this behavior change, it appears to me the rest of this PR can remain.