Skip to content

Add time-length windowing capability to moving statistics #936

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wesm opened this issue Mar 18, 2012 · 14 comments
Closed

Add time-length windowing capability to moving statistics #936

wesm opened this issue Mar 18, 2012 · 14 comments
Labels
Datetime Datetime data dtype Enhancement
Milestone

Comments

@wesm
Copy link
Member

wesm commented Mar 18, 2012

No description provided.

@BrenBarn
Copy link

If I understand this right, it's similar to what I'm asking here: http://stackoverflow.com/questions/14300768/pandas-rolling-computation-with-window-based-on-values-instead-of-counts . I think the facility should not be time-specific. You should be able to use windows of any value range on any sort of values, not just time values.

@invisibleroads
Copy link
Contributor

Thanks for sharing!

@ghost
Copy link

ghost commented Mar 22, 2013

@jreback - cookbook.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

in cookbook...closing

@jreback jreback closed this as completed Sep 21, 2013
@SmartLayer
Copy link

When rolling_mean was found in stats.moments, it is naturally assumed time is weighted, otherwise it should appear in stats instead. However I found it is not weighted.

>>> ts = Series(randn(15), index=date_range('1/1/2000', periods=15))
>>> ts
2000-01-01   -0.195255
2000-01-02    0.920142
2000-01-03    1.498506
2000-01-04   -0.923250
2000-01-05   -0.775110
2000-01-06    1.533274
2000-01-07    1.455366
2000-01-08   -1.738300
2000-01-09    0.102575
2000-01-10   -1.767898
2000-01-11    1.890013
2000-01-12   -1.106158
2000-01-13    0.457826
2000-01-14   -0.951881
2000-01-15   -1.738844
Freq: D, dtype: float64
>>> ts2 = Series(ts, index=date_range('1/1/2000', periods=10)+date_range('20/1/2000', periods=5))
>>> ts2
2000-01-01   -0.195255
2000-01-02    0.920142
2000-01-03    1.498506
2000-01-04   -0.923250
2000-01-05   -0.775110
2000-01-06    1.533274
2000-01-07    1.455366
2000-01-08   -1.738300
2000-01-09    0.102575
2000-01-10   -1.767898
2000-01-20         NaN
2000-01-21         NaN
2000-01-22         NaN
2000-01-23         NaN
2000-01-24         NaN
dtype: float64
>>> ts2[-5:] = [1.890013,-1.106158,0.457826,-0.951881,-1.738844]
>>> # now that ts1 and ts2 are idential in value but different in index
>>> ts2
2000-01-01   -0.195255
2000-01-02    0.920142
2000-01-03    1.498506
2000-01-04   -0.923250
2000-01-05   -0.775110
2000-01-06    1.533274
2000-01-07    1.455366
2000-01-08   -1.738300
2000-01-09    0.102575
2000-01-10   -1.767898
2000-01-20    1.890013
2000-01-21   -1.106158
2000-01-22    0.457826
2000-01-23   -0.951881
2000-01-24   -1.738844
dtype: float64
>>> TS = ts.cumsum()
>>> TS2 = ts2.cumsum()
>>> # you will about to find rolling_mean(TS) and rolling_mean(TS2) produce same result!
>>> rolling_mean(TS, 1)
2000-01-01   -0.195255
2000-01-02    0.724887
2000-01-03    2.223392
2000-01-04    1.300143
2000-01-05    0.525033
2000-01-06    2.058307
2000-01-07    3.513673
2000-01-08    1.775373
2000-01-09    1.877948
2000-01-10    0.110050
2000-01-11    2.000063
2000-01-12    0.893905
2000-01-13    1.351731
2000-01-14    0.399849
2000-01-15   -1.338994
Freq: D, dtype: float64
>>> rolling_mean(TS2, 1)
2000-01-01   -0.195255
2000-01-02    0.724887
2000-01-03    2.223392
2000-01-04    1.300143
2000-01-05    0.525033
2000-01-06    2.058307
2000-01-07    3.513673
2000-01-08    1.775373
2000-01-09    1.877948
2000-01-10    0.110050
2000-01-20    2.000063
2000-01-21    0.893905
2000-01-22    1.351731
2000-01-23    0.399850
2000-01-24   -1.338994
dtype: float64

The above experiment demonstrated that rolling_mean disregard the 10 missing days in TS2, and calculate as if the data is evenly sampled, that is, as if it is not Time Series. I am afraid users are not expecting this behaviour.

I believe this problem is integral to the feature asked here. If time is properly weighted in the calculation, there is no reason why window canot be specified with a time-frame. Solving the asked feature also solves this unwanted behaviour.

@SmartLayer
Copy link

in cookbook...closing --> IN which chapter of the cookbook? Looked, not found. (it is not searchable and Google search with rolling_mean as keyword only yeild results outside of the cookbook: if what you meant of cookbook is this one: http://pandas.pydata.org/pandas-docs/stable/cookbook.html

@jreback
Copy link
Contributor

jreback commented Jan 15, 2014

@SmartLayer
Copy link

Oh I see, what is added to the cookbook is a link from cookbook to stackoverflow, that's why "Google search with rolling_mean as keyword only yeild results outside of the cookbook"

@jreback
Copy link
Contributor

jreback commented Jan 15, 2014

you can search from the docs, the API box

@jreback
Copy link
Contributor

jreback commented Jan 15, 2014

the cookbook is really just a collection of interesting links

@ghost
Copy link

ghost commented Jan 15, 2014

@jreback, the cookbook entry is related but does it truely close this issue? can't say.

@jreback jreback reopened this Jan 15, 2014
@jorisvandenbossche
Copy link
Member

I also think the cookbook entry is not the real solution to this issue, although you can in principle solve this issue with it (but not that trivially for users I think).
What I understand from the original issue, is that you could do something like this:

pd.rolling_mean(ts, window='30min')

When you have eg regular timeseries of 5 min frequency, this would be the same as pd.rolling_mean(ts, 6) but 1) more convenient and 2) also applicable for irregular time series.
I think this would be a very valuable addition.

@zhangweiwu for the example you give, you can also use the freq keyword (will resampla the data to the give frequency) or manually resample the data first.

@ajcremona
Copy link

Is there any update on this? Very interested in this - very useful for irregular time series that are large data sets

@jreback jreback modified the milestones: 0.18.2, Next Major Release Jun 26, 2016
@vik748
Copy link

vik748 commented Jul 14, 2022

Folks I would like to help add a similar feature for dataframes with a scalar index or column. Looks like all current windows are based on the number of samples around the point of interest. Any tips / thoughts on where I should start and what the feature should look like?

Example, if we have a dataframe with a column "total distance travelled" and another "total fuel used". The rows can be irregular. I'd like to answer max and fuel consumed per 1km travelled. You can get the gist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants