-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: add time-window capability to .rolling #13513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @jorisvandenbossche if anyone would like to try to fixup min/max would be great. here is the same PR with a priority heap (but its a fair bit slower, I am probably doing something non-optimal). |
closes #936 |
yeah we'll close #936 but I think that impl is a bit different. |
Current coverage is 84.39%@@ master #13513 diff @@
==========================================
Files 142 142
Lines 51235 51299 +64
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43234 43296 +62
- Misses 8001 8003 +2
Partials 0 0
|
Looks good to me. Regarding max/min, is the priority list faster than just recomputing the maximum value from scratch for each iteration? Caching and branch prediction can work wonders for small sample sizes. |
Long awaited feature! Will try to review this today. |
Take a look at this: https://people.cs.uct.ac.za/~ksmith/articles/sliding_window_minimum.html. You don't want to use a heap for min/max as it's worst case O(n * log(n)). There is a more efficient algorithm that is O(N). It also should be a bit easier to implement as all you really need for a data structure is a deque. That's the algorithm that the existing implementation was using. |
ok, silly of me not to test the completely naive soln, which is quite fast. It still scales on O(n) where n is the window size, but now the constant is 1/10 of my prioritylist constant. So this is pretty reasonable.
|
@joshuastorck yes the deque is theoretically faster, but it was very awkward with a variable sized window; others welcome to take a shot thought! |
Meta-comment: since getting exposed to Gerrit over the last couple years, large reviews on GitHub PR's tend to encourage less careful review (because it's difficult to keep track of what still needs doing sometimes), at least for me. It might be worth playing around with gerrithub.io sometime to mirror patches there for review sometimes. Not sure what's the best solution. |
if index is None: | ||
index = self._on | ||
return index, index.asi8, self.win_unit | ||
return index, index, self.win_unit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little bit opaque. Is the purpose of the 2nd entry in the tuple to ensure you are always passing integers to the Cython function?
This looks like it was quite a slog -- we should add to our ever-increasing TODO list a review of all of our aggregation code and ways to foster reuse or more flexibility. Trying to think if there was a way that this could have been made less painful. C++ templates could help (for example, a particular aggregator could be passed as a template argument, so the update inner loop could all get inlined at compile time -- e.g. the add_sum / remove_sum bits). But we'd need to decide if we want to commit to the route of writing more C++ and less Cython (I think there's good arguments for it, but evolving the build-and-deployment tools to accommodate that would be as discussed an upfront investment of work) -- this is somewhat orthogonal to the bigger discussion of writing more of the data structure internals in C or C++. |
Also, I'm always happy to see 4+ year-old issues that were once just a twinkle in my eye get realized, so better late than never =) Great stuff |
I think there are a couple of compelling arguments for going to C++ templates, mostly that you can vary the implementation based on type at compile time. For example, since integer columns don't have NaN's, you don't really need the min_periods for rolling_* functions and could use SFINAE for that distinction. Alternatively, you could put 'if' statements inside the inner loops that are based on type, such as 'if std::is_integral::value". With the rolling functions, much of the looping code is common, but the algorithm and state varies. You can define a rolling_apply function that takes iterators, a functor/lambda and variadic arguments to hold the state. And it goes without saying that there are a lot of possibilities for using SIMD when dealing with contiguous columns. The downside with C++ is that you have to do a lot of boilerplate dispatching code to templates based on numpy type, the cross-platform compile gets more complicated, and the code could arguably be harder to read. At a minimum, it would be worth comparing the runtime of hand converted Cython code, as the code generator for Cython introduces a lot of constructs that might make it harder for the compiler to do optimizations, such as sanity checks in inner loops that could probably be done up front. |
@jreback There is a memory leak when using |
@jorisvandenbossche I didn't change anything w.r.t. |
With the example from your notebook: |
@jorisvandenbossche pushed. Its not implemented. |
@@ -3,15 +3,16 @@ | |||
v0.19.0 (August ??, 2016) | |||
------------------------- | |||
|
|||
This is a major release from 0.18.2 and includes a small number of API changes, several new features, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche slight changes need here (which are in master). done here i think is fine.
ok, well I'd like to push this into master now; now that we are on 0.19, this can be tried out for a bit by the deer-in-the-headlights crowd who run pandas master 👯 If there are later suggestions for a different API, then these can be addressed then. To be honest this is a highly theoretical use case where you actually want separate args. So I think that we'd need to go whole hog and deprecate But that is really -1 from me. |
5a8d290
to
9af83c7
Compare
xref pandas-dev#13327 CLN: pep for cython, xref pandas-dev#12995
any final comments? / merging soon. |
Well, I'm still not happy with this API choice but I suppose I can live with @wesm's suggested compromise. It seems very awkward to me, though. |
And to be clear, by "awkward" I mean that I think the current implementation will lead to bugs and confusion in user code. But that's the price we pay for convenience... |
@shoyer so to make sure I understand the core concern, it's that this overloads the semantics of the window argument -- either you get a fixed number of periods or a variable-length window based on the type of input? We could certainly introduce auxiliary APIs On the other hand, you could imagine having code like
(this might be a lot more complex in practice), and this could be easily switched from variable window to fixed window through the |
At least, I feel the current state of things is much better / more consistent than the earlier smorgasbord of |
Yes, this is my concern. I strongly prefer function arguments that only do one thing, independent of the type of the data -- especially when the behavior is potentially ambiguous, as in this case.
Yes, this is another clean way to handle the API.
Indeed, it's certainly more convenient -- at least at first -- to reuse a single argument for these two different purposes. This could even make sense for special purpose code, e.g., if the user is happy to make the assumption that
Totally agreed! I'm very happy we were able to switch to the |
This is smaller, but is Although I suggested |
we use on for pd.merge already |
We are a bit stuck it seems ... Below I tried to list the possibilities to go forward based on the discussions above (note: all used keyword/method names are of course open for discussion :-), I used 'size' and 'delta' for now to make a distinction)
|
I still don't buy that this overloading is bad in any way. I'd like @shoyer to convince why having the API in the is PR is not a good idea.
This is just bunk. it is more awkward to having confusing arguments in the first place which is what 2/3 by @jorisvandenbossche suggest. Python is NOT c++, nor is it Java. We should be pythonic and follow current conventions. Pandas DOES quite a lot of inference. The POINT is to have convenience. Trying to change something just for the sake of some theoretically purity is not in the spirit of python (and pandas at all). In fact, I would argue that making this NOT similar to simple, unambiguous usage is a MAJOR break in and of itself. You have to have a VERY convincing argument to actually change things. |
I agree with @jreback. Overloading the parameters makes things pretty easy since the user needs only to know about one input. Let the machine figure-out what to do with it. |
There is certainly precedence for this sort of behavior in pandas, but it's not the good kind -- it's confusing features like Can you find a single example of this sort of thing in the Python standard library? In my view it is very non-Pythonic, in addition to being poor software design more generally. "In the face of ambiguity, refuse the temptation to guess." |
What feels different about this one (vs. I do see the point about adding an interval window in the future...but even then - say this was the API - is the second row really ambiguous? df.rolling('2s').sum()
df.rolling(5).sum()
df.rolling(numeric_interval=5.0).sum() |
Here's my advice for how to move forward:
While not exactly equivalent, I see this as similar to 'ndarray.astype' -- effectively what we are doing is coercing the argument to a Window spec. We might consider adding a "WindowSpec" type of object that can be passed. The variable windows will only be used by a pretty narrow subset of users -- I agree we should continue to be careful about argument overloading. |
@chris-b1 I agree, this is not as bad as @wesm OK, I can live with that. I'm pretty sure that we'll eventually need to add explicit keyword arguments, because there's no other sane way to handle fixed label width rolling windows for non-time labels.
|
@shoyer I think you missed my original point. pandas does lots of 'magical' things. Changing something to non-magical will be a huge change. Yes you can attempt to forestall increasing the magic, but decreasing magic itself is a rather large API change. You certainly could, but it should then be vetted like any other API change. merging |
This is exactly why I argued for not adding the magic here in the first place.... |
pandas already has magic around these types of things. changing it is bad. (unless you basically change everything) |
hey folks, just a meta-comment on this discussion. While these are all plain text exchanges and tone is devilishly hard to convey, I encourage all to be particularly mindful of how this discussion might read to outside observers or would-be contributors to the project. I realize that the disagreement about the API dragged on for about 2 weeks, and I think we all would agree that 2 weeks is too long, and some frustration about this is understandable (always be closing, right?). Can we come up with a working plan for the future when resolving these sorts of contentious issues? I should have intervened sooner to resolve the impasse (bad timing: had a particularly busy couple of weeks, and also was on vacation and not looking at e-mail / github very much). The plan might be just to bug me to help arbitrate. As far as the specifics of points of view were argued, as frustrating as it may be, there's some things we should definitely avoid. I hate to nail a particular comment to the wall but: #13513 (comment) — statements like "this is just bunk" are not helpful. It is OK to place the burden of proof on the person raising objections, but I think a more constructive way would be "I still disagree with you about this. Can you give some concrete examples illustrating why you feel this way?" We won't always agree but we should always strive to validate others' concerns and keep the conversation both constructive and productive (which I think we can do without being needling or dismissive). I agree that pandas has teetered on the spectrum between preciseness / unambiguity and convenience, and we (e.g. yours truly) have made some lapses in judgment that have been costly to remedy (like Thanks |
I find the time-based window implementation to be confusing.
and |
As @magicmathmandarin pointed out, min_periods does not work as expected, just try:
But that seem to be due to the variability of the sampling rate
I guess you could recover the correct output by doing something like:
|
xref #13327
closes #936
This notebook shows the usecase
This implements time-ware windows, IOW, to a
.rolling()
you can now pass a ragged / sparse timeseries and have it work with an offset (e.g. like2s
). Previously you could achieve these results by resampling first, then using an integer period (though you would have to jump thru hoops when crossing day boundaries and such).This now provides a nice easy / performant implementation (as indicated on the linked notebook, min/max impl is giving scaling issues).