Skip to content

Adds cummulative operators to API #812

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 3, 2016
Merged

Conversation

pwolfram
Copy link
Contributor

@pwolfram pwolfram commented Mar 31, 2016

This PR will add cumsum and cumprod as discussed in #791 as well ensuring cumprod works for the API, resolving issues discussed at #807.

TO DO (dependencies)

This PR extends infrastructure to support cumsum and cumprod (#791).

References:

cc @shoyer, @jhamman

@pwolfram
Copy link
Contributor Author

@mrocklin and @shoyer, will we need to modify the definitions for nanprod, nancumsum, and nancumprod in dask for this to work in multi-threaded mode? I took a quick look and it appears they are defined https://github.com/dask/dask/blob/d82cf2ac3fa3a61912b7934afe7b2fe9e14cc4ff/dask/array/__init__.py#L17-L22 so I'm assuming xarray/dask should just work once the issues on the xarray end are resolved but just wanted to double check to make sure this is the case.

@shoyer
Copy link
Member

shoyer commented Mar 31, 2016

@pwolfram dask will also definitely need nancumsum/nancumprod functions. These should be quite straightforward to add though given the existing infrastructure.

@pwolfram
Copy link
Contributor Author

pwolfram commented Apr 1, 2016

@shoyer see dask/dask#1077 for the nancumsum and nancumprod dask PR

@pwolfram pwolfram force-pushed the add_cumsum_cumprod branch from 72b6e94 to 9ca86d4 Compare April 7, 2016 18:39
@pwolfram
Copy link
Contributor Author

pwolfram commented Apr 7, 2016

@shoyer, it looks like I'll need to have a _reduce_method-like method that doesn't reduce the dimensions, e.g., https://github.com/pydata/xarray/blob/master/xarray/core/common.py#L11. I haven't been able to get the current branch to work properly (it returns a numpy array vs xarray datatype) and seem to be something missing. Am I on the right track that I need to add a new abstract method? This seems overly complicated but I haven't been able to get this to work cleanly otherwise, e.g., trying things like

    f = _func_slash_method_wrapper(method, name)
    setattr(cls, name, cls._binary_op(f))

Some advice or help to get me out of my naivete would be greatly appreciated. Thanks!

@jhamman
Copy link
Member

jhamman commented May 11, 2016

@pwolfram - how's this going? Are we still stuck on the above question?

@pwolfram
Copy link
Contributor Author

@jhamman, I need to get back to this when I can make the time but will let you know if I have more trouble. Thanks for following up.

@pwolfram
Copy link
Contributor Author

pwolfram commented Sep 20, 2016

@shoyer and @jhamman here is the general use of the cumsum and cumprod operators. Note, I probably need to have some type of error checking for the dask version (e.g., we require a version of dask with nancumsum, nanprod, and nancumprod). What is the standard way to do this?

For example, dask 0.11.0 works but dask 0.8.1 does not and returns an error.

@pwolfram pwolfram force-pushed the add_cumsum_cumprod branch 3 times, most recently from 70ae93b to 13ace00 Compare September 20, 2016 19:50
@pwolfram
Copy link
Contributor Author

@shoyer, @jhamman, and @MaximilianR this should be ready for a preliminary review because it works. The key thing missing is a check for dask version and potentially more testing. Thoughts on these issues are greatly appreciated.

@@ -893,16 +893,27 @@ def reduce(self, func, dim=None, axis=None, keep_attrs=False,
if dim is not None and axis is not None:
raise ValueError("cannot supply both 'axis' and 'dim' arguments")

if 'cum' in func.__name__:
Copy link
Member

@shoyer shoyer Sep 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put these error checks in ops.cumsum instead? It's poor separation of concerns to do these checks over here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so unless there is a clever way to do this that I'm missing. It looks like I'd need to have something like a new _partial_reduce_method that is for cumsum, cumprod, prod, etc. However, this will require additional code lines beyond the simplistic approach I've taken. But this may be necessary to get this to work in general in a clean way. For instance, I don't think the existing code works properly with dataset yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer, this obviously changes the comments below because we need this to work for both dataset and dataarray and I need to verify the existing implementation does just that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer, I double-checked. This appears to work for both dataarray and dataset but it would be good to have some tests to ensure this functionality works in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some tests and will push them soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't individual test dataarray and dataset functionality but assume it is inherited by variable. Hence, I've tested variable methods for cumsum and cumprod for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't like this approach. It feels very fragile.

A slightly saner approach would another function attribute like numeric_only that we use in ops.py. Then this check could be: keep_dims = getattr(func, 'keep_dims', False).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in favor of use of keep_dims approach

if n not in removed_axes]

if 'cum' in func.__name__:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above -- let's keep Variable.reduce unmodified for now. I think we'll want to actually implement this using xarray.apply, anyways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I guess it's OK to keep this -- it is all we need to actually implement cumsum, after all, but let's do the shape check unilaterally, without looking at the function name.

Copy link
Contributor Author

@pwolfram pwolfram Sep 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer, I had originally implemented it that way, but this introduced a bug- related to use of prod if I recall correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thankfully, it appears to work without issue on re-examination.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified as you discussed below

return _prod(values, axis=axis, **kwargs)
prod.numeric_only = True

def cumsum(values, axis=None, skipna=None, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make a wrapper function that generates these functions, instead of repeating the logic three times for prod/cumsum/cumprod?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better than that-- I extended functionality to _create_nan_agg_method to generalize, which removes quite a few lines of code. See d1c1077.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@@ -274,7 +274,8 @@ def _ignore_warnings_if(condition):
yield


def _create_nan_agg_method(name, numeric_only=False, coerce_strings=False):
def _create_nan_agg_method(name, numeric_only=False, np_compat=False,
no_bottleneck=False, coerce_strings=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep indentation aligned with ( per PEP8

else:
eager_module = bn
func = _dask_or_eager_func(nanname, eager_module)
using_numpy_nan_func = eager_module is np
using_numpy_nan_func = (eager_module is np) or (eager_module is npcompat)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a nit: is has higher precedence than or, so you don't need parentheses

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shoyer!

if n not in removed_axes]

if 'cum' in func.__name__:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I guess it's OK to keep this -- it is all we need to actually implement cumsum, after all, but let's do the shape check unilaterally, without looking at the function name.

if n not in removed_axes]

if 'cum' in func.__name__:
def safe_shape(val):
return val.shape if type(val) is np.ndarray else ()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be just getattr(val, 'shape', ()) (dask arrays have shape defined, too)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shoyer!

@pwolfram pwolfram force-pushed the add_cumsum_cumprod branch 2 times, most recently from 52b737f to 4087bfc Compare September 21, 2016 20:32
@pwolfram
Copy link
Contributor Author

@shoyer, this should be ready for another review. I also tested it with this somewhat hacky code at https://gist.github.com/a329d441fe99ae342a34b1a374650138. It may be good to get some type of test like this into the test suite. However, the correct location for testing these methods, in general, is not transparent to me. It doesn't look like we broadly check reduction operations with nans, e.g., prod outside the test_variable.py file. I have made additions here but broader testing may be useful.

def safe_shape(val):
return getattr(val, 'shape', ())

if safe_shape(data) == safe_shape(self.data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done alternatively to calculating dims a few lines above, e.g.,

if getattr(data, 'shape', ()) == self.shape:
    dims = self.dims
else:
    removed_axes = ...
    dims = [adim for n, admin in enumerate(self.dims) ...]

Otherwise we calculate those dimensions just to throw them away

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

@@ -893,16 +893,27 @@ def reduce(self, func, dim=None, axis=None, keep_attrs=False,
if dim is not None and axis is not None:
raise ValueError("cannot supply both 'axis' and 'dim' arguments")

if 'cum' in func.__name__:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't like this approach. It feels very fragile.

A slightly saner approach would another function attribute like numeric_only that we use in ops.py. Then this check could be: keep_dims = getattr(func, 'keep_dims', False).

@shoyer
Copy link
Member

shoyer commented Sep 21, 2016

We don't need to verify that every value is exactly as expected for Dataset/DataArray, but we should verify the general API (e.g., do at least one of cumsum/cumprod and make sure the result has the right dimensions and errors when it should). See the various test_reduce methods in test_dataset.py for examples.

@pwolfram
Copy link
Contributor Author

@shoyer, I think this fixes the concerns you raised including the testing. Thanks for all the tips!

@shoyer
Copy link
Member

shoyer commented Sep 29, 2016

Can you add a basic sanity check for DataArray.cumsum?

Others I think this just needs docs (on the What's New and API pages)

@pwolfram pwolfram force-pushed the add_cumsum_cumprod branch from 6040fb7 to dfbc090 Compare October 3, 2016 15:58
@pwolfram
Copy link
Contributor Author

pwolfram commented Oct 3, 2016

@shoyer, is this what you were thinking?

@@ -62,6 +62,9 @@ By `Robin Wilson <https://github.com/robintw>`_.
overlapping (:issue:`835`) coordinates as long as any present data agrees.
By `Johnnie Gray <https://github.com/jcmgray>`_.

- Adds DataArray and Dataset methods :py:meth:`cumsum` and :py:meth:`cumprod`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the links work, use, e.g.,

:py:meth:~DataArray.cumsum

Copy link
Contributor Author

@pwolfram pwolfram Oct 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shoyer, this is fixed now and the link should now work following minor refractoring. However, a search for cumsum does not return the DataArray and Dataset results for my local test, which is very strange.

@pwolfram pwolfram force-pushed the add_cumsum_cumprod branch from dfbc090 to 8817af5 Compare October 3, 2016 19:43
@pwolfram pwolfram force-pushed the add_cumsum_cumprod branch from 8817af5 to 129c807 Compare October 3, 2016 20:05
@@ -145,6 +145,8 @@ Computation
:py:attr:`~Dataset.round`
:py:attr:`~Dataset.real`
:py:attr:`~Dataset.T`
:py:attr:`~Dataset.cumsum`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a nit, these probably belong under the "Aggregation" heading above

@shoyer shoyer merged commit 9cf107b into pydata:master Oct 3, 2016
@shoyer
Copy link
Member

shoyer commented Oct 3, 2016

Thanks! Let's see how the docs look at http://xarray.pydata.org/en/latest/whats-new.html in a few minutes after the doc build completes

@pwolfram pwolfram deleted the add_cumsum_cumprod branch October 3, 2016 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants