Skip to content

Inconsistency between Sum of NA's and Mean of NA's: resampling gives 0 or 'NA' #2230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rpnaut opened this issue Jun 13, 2018 · 7 comments
Closed

Comments

@rpnaut
Copy link

rpnaut commented Jun 13, 2018

Problem description

For datamining with xarray there is always the following issue with the resampling-method.
If i resample e.g. a daily timeseries over one month and if the data are 'NA' at each day, I get zero as a result. That is annoying considering a timeseries of precipitation. It is definitely a difference if the monthly precipitation is zero for one month (each day zero precipitation) or the monthly precipitation was not measured due to problems with the device (each day NA)

Data example

I have a dataset with hourly values for 5 month 'fcut'.

<xarray.Dataset>
Dimensions:       (bnds: 2, time: 3672)
Coordinates:
    rlon          float32 22.06
    rlat          float32 5.06
  * time          (time) datetime64[ns] 2006-05-01 2006-05-01T01:00:00 ...
Dimensions without coordinates: bnds
Data variables:
    rotated_pole  int32 1
    time_bnds     (time, bnds) float64 1.304e+07 1.305e+07 1.305e+07 ...
    TOT_PREC      (time) float64 nan nan nan nan nan nan nan nan nan nan nan ...
Attributes:

Doing a resample process gives only zero values for each month.

In [10]: fcut.resample(dim='time',freq='M',how='sum')
Out[10]: 
<xarray.Dataset>
Dimensions:       (bnds: 2, time: 5)
Coordinates:
  * time          (time) datetime64[ns] 2006-05-31 2006-06-30 2006-07-31 ...
Dimensions without coordinates: bnds
Data variables:
    rotated_pole  (time) int64 1 1 1 1 1
    time_bnds     (time, bnds) float64 1.07e+10 1.07e+10 1.225e+10 1.225e+10 ...
    TOT_PREC      (time) float64 0.0 0.0 0.0 0.0 0.0

But I expect to have NA for each month, as it is the case for the operator 'mean'

I know that there is an ongoing discussion about that topic (see for example pandas-dev/pandas#9422).

For earth science it would be nice to have an option telling xarray what to do in case of a sum over values being all NA. Do you see a chance to have a fast fix for that issue in the model code?

@fujiisoup
Copy link
Member

fujiisoup commented Jun 13, 2018

Thank you for raising an issue.
Could you try using .sum(skipna=False) for resampled data?

As similar to pandas.DataFrame.sum, our .sum (and other reduction methods) assumes skipna=True unless explicitly specified.

@rpnaut
Copy link
Author

rpnaut commented Jun 13, 2018

I can overcome this by using

In [14]: fcut.resample(dim='time',freq='M',how='mean',skipna=False)
Out[14]: 
<xarray.Dataset>
Dimensions:       (bnds: 2, time: 5)
Coordinates:
  * time          (time) datetime64[ns] 2006-05-31 2006-06-30 2006-07-31 ...
Dimensions without coordinates: bnds
Data variables:
    rotated_pole  (time) float64 1.0 1.0 1.0 1.0 1.0
    time_bnds     (time, bnds) float64 1.438e+07 1.438e+07 1.702e+07 ...
    TOT_PREC      (time) float64 nan nan nan nan nan

BUT THE PROBLEM IS:

A) that this behaviour is in contradiction to the computation of a mean. I can always compute a mean with the default option 'skipna=True' regardless I have a few NA's in the timeseries (the output is a number not considering the NA's) or only NA's in the timeseries (the output is NA). This is what i would expect.

B) that setting `skipna=False' does not allow for computations if only one value of the timeseries is NA.

I would like to have the behaviour of the mean operator also for the sum operator.

Also for the climate data operators (CDO) the developers decided to give the users two options, skipna=True and skipna=False. But skipna == TRUE should result in the same behaviour for both operators (mean and sum).

@rpnaut rpnaut changed the title Sum of NA's: resampling with method 'sum' gives 0 instead of NA Inconsistency between Sum of NA's and Mean of NA's: resampling gives 0 or 'NA' Jun 13, 2018
@shoyer
Copy link
Member

shoyer commented Jun 13, 2018

The difference between mean and sum here isn't resample specific. Xarray consistently interprets a "NA skipping sum" consistently as returning 0 in the case of all NaN inputs:

>>> float(xarray.DataArray([np.nan]).sum())
0.0

This is consistent with the sum of an empty set being 0, e.g.,

>>> float(xarray.DataArray([]).sum())
0.0

The reason why a "NA skipping mean" is different in the case of all NaN inputs is that the mean simply isn't well defined on an empty set. The mean would literally be a sum of zero divided by a count of zero, which is not a valid number: the literal meaning of NaN as "not a number".

There was a long discussion/debate about this recently in pandas. See pandas-dev/pandas#18678 and links there-in. There are certainly use-cases where it is nicer for the sum of all NaN outputs to be NaN (exactly as you mention here), but ultimately pandas decided that the answer for this operation should be zero. The decisive considerations were simplicity and consistency with other tools (including NumPy and R).

What pandas added to solve this use-case is an optional min_count argument (see pandas.DataFrame.sum for an example). We could definitely copy this behavior in xarray if someone is interested in implementing it.

@shoyer
Copy link
Member

shoyer commented Jun 13, 2018

OK, I see you already saw the pandas issues :).

For earth science it would be nice to have an option telling xarray what to do in case of a sum over values being all NA. Do you see a chance to have a fast fix for that issue in the model code?

Yes, I would be very open to adding a min_count argument.

We could probably copy the implementation of sum with min_count largely from pandas:
https://github.com/pandas-dev/pandas/blob/0c4e611927772af44b02204192b29282341a5716/pandas/core/nanops.py#L329

In xarray this would go into _create_nan_agg_method in https://github.com/pydata/xarray/blob/master/xarray/core/duck_array_ops.py (sorry, this has gotten a little messy!)

@rpnaut
Copy link
Author

rpnaut commented Jun 14, 2018

I really have problems in reading the code in duck_array_ops.py. The program starts with defining 12 operators. One of them is:

sum = _create_nan_agg_method('sum', numeric_only=True)

I really do not understand where the train is going. Thats due to my limited programming skills for object-oriented code. No guess what '_create_nan_agg_method' is doing. I tried to change the code in method

def _nansum_object(value, axis=None, **kwargs):
    """ In house nansum for object array """
    return _dask_or_eager_func('sum')(value, axis=axis, **kwargs)
    #return np.array(np.nan)

but it seems that he will not touch that method during the 'resample().sum()' process.

I need some help to really modify the operators. Is there any hint for me? For the pandas code it seems to be much easier.

@fujiisoup fujiisoup mentioned this issue Jun 18, 2018
4 tasks
@fujiisoup
Copy link
Member

@rpnaut, thanks for lookng inside the code.
See #2236.

@rpnaut
Copy link
Author

rpnaut commented Jun 21, 2018

Thank you for considering that issue in your pull request #2236.
I will switch to comment your work in the related thread, but I would leave this issue open until a solution is found for the min_count option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants