Inconsistency between Sum of NA's and Mean of NA's: resampling gives 0 or 'NA' #2230

rpnaut · 2018-06-13T12:54:47Z

Problem description

For datamining with xarray there is always the following issue with the resampling-method.
If i resample e.g. a daily timeseries over one month and if the data are 'NA' at each day, I get zero as a result. That is annoying considering a timeseries of precipitation. It is definitely a difference if the monthly precipitation is zero for one month (each day zero precipitation) or the monthly precipitation was not measured due to problems with the device (each day NA)

Data example

I have a dataset with hourly values for 5 month 'fcut'.

<xarray.Dataset>
Dimensions:       (bnds: 2, time: 3672)
Coordinates:
    rlon          float32 22.06
    rlat          float32 5.06
  * time          (time) datetime64[ns] 2006-05-01 2006-05-01T01:00:00 ...
Dimensions without coordinates: bnds
Data variables:
    rotated_pole  int32 1
    time_bnds     (time, bnds) float64 1.304e+07 1.305e+07 1.305e+07 ...
    TOT_PREC      (time) float64 nan nan nan nan nan nan nan nan nan nan nan ...
Attributes:

Doing a resample process gives only zero values for each month.

In [10]: fcut.resample(dim='time',freq='M',how='sum')
Out[10]: 
<xarray.Dataset>
Dimensions:       (bnds: 2, time: 5)
Coordinates:
  * time          (time) datetime64[ns] 2006-05-31 2006-06-30 2006-07-31 ...
Dimensions without coordinates: bnds
Data variables:
    rotated_pole  (time) int64 1 1 1 1 1
    time_bnds     (time, bnds) float64 1.07e+10 1.07e+10 1.225e+10 1.225e+10 ...
    TOT_PREC      (time) float64 0.0 0.0 0.0 0.0 0.0

But I expect to have NA for each month, as it is the case for the operator 'mean'

I know that there is an ongoing discussion about that topic (see for example pandas-dev/pandas#9422).

For earth science it would be nice to have an option telling xarray what to do in case of a sum over values being all NA. Do you see a chance to have a fast fix for that issue in the model code?

The text was updated successfully, but these errors were encountered:

fujiisoup · 2018-06-13T13:00:45Z

Thank you for raising an issue.
Could you try using .sum(skipna=False) for resampled data?

As similar to pandas.DataFrame.sum, our .sum (and other reduction methods) assumes skipna=True unless explicitly specified.

rpnaut · 2018-06-13T13:21:40Z

I can overcome this by using

In [14]: fcut.resample(dim='time',freq='M',how='mean',skipna=False)
Out[14]: 
<xarray.Dataset>
Dimensions:       (bnds: 2, time: 5)
Coordinates:
  * time          (time) datetime64[ns] 2006-05-31 2006-06-30 2006-07-31 ...
Dimensions without coordinates: bnds
Data variables:
    rotated_pole  (time) float64 1.0 1.0 1.0 1.0 1.0
    time_bnds     (time, bnds) float64 1.438e+07 1.438e+07 1.702e+07 ...
    TOT_PREC      (time) float64 nan nan nan nan nan

BUT THE PROBLEM IS:

A) that this behaviour is in contradiction to the computation of a mean. I can always compute a mean with the default option 'skipna=True' regardless I have a few NA's in the timeseries (the output is a number not considering the NA's) or only NA's in the timeseries (the output is NA). This is what i would expect.

B) that setting `skipna=False' does not allow for computations if only one value of the timeseries is NA.

I would like to have the behaviour of the mean operator also for the sum operator.

Also for the climate data operators (CDO) the developers decided to give the users two options, skipna=True and skipna=False. But skipna == TRUE should result in the same behaviour for both operators (mean and sum).

shoyer · 2018-06-13T21:19:55Z

The difference between mean and sum here isn't resample specific. Xarray consistently interprets a "NA skipping sum" consistently as returning 0 in the case of all NaN inputs:

>>> float(xarray.DataArray([np.nan]).sum())
0.0

This is consistent with the sum of an empty set being 0, e.g.,

>>> float(xarray.DataArray([]).sum())
0.0

The reason why a "NA skipping mean" is different in the case of all NaN inputs is that the mean simply isn't well defined on an empty set. The mean would literally be a sum of zero divided by a count of zero, which is not a valid number: the literal meaning of NaN as "not a number".

There was a long discussion/debate about this recently in pandas. See pandas-dev/pandas#18678 and links there-in. There are certainly use-cases where it is nicer for the sum of all NaN outputs to be NaN (exactly as you mention here), but ultimately pandas decided that the answer for this operation should be zero. The decisive considerations were simplicity and consistency with other tools (including NumPy and R).

What pandas added to solve this use-case is an optional min_count argument (see pandas.DataFrame.sum for an example). We could definitely copy this behavior in xarray if someone is interested in implementing it.

shoyer · 2018-06-13T21:27:33Z

OK, I see you already saw the pandas issues :).

For earth science it would be nice to have an option telling xarray what to do in case of a sum over values being all NA. Do you see a chance to have a fast fix for that issue in the model code?

Yes, I would be very open to adding a min_count argument.

We could probably copy the implementation of sum with min_count largely from pandas:
https://github.com/pandas-dev/pandas/blob/0c4e611927772af44b02204192b29282341a5716/pandas/core/nanops.py#L329

In xarray this would go into _create_nan_agg_method in https://github.com/pydata/xarray/blob/master/xarray/core/duck_array_ops.py (sorry, this has gotten a little messy!)

rpnaut · 2018-06-14T14:20:10Z

I really have problems in reading the code in duck_array_ops.py. The program starts with defining 12 operators. One of them is:

sum = _create_nan_agg_method('sum', numeric_only=True)

I really do not understand where the train is going. Thats due to my limited programming skills for object-oriented code. No guess what '_create_nan_agg_method' is doing. I tried to change the code in method

def _nansum_object(value, axis=None, **kwargs):
    """ In house nansum for object array """
    return _dask_or_eager_func('sum')(value, axis=axis, **kwargs)
    #return np.array(np.nan)

but it seems that he will not touch that method during the 'resample().sum()' process.

I need some help to really modify the operators. Is there any hint for me? For the pandas code it seems to be much easier.

fujiisoup · 2018-06-18T12:59:48Z

@rpnaut, thanks for lookng inside the code.
See #2236.

rpnaut · 2018-06-21T10:48:21Z

Thank you for considering that issue in your pull request #2236.
I will switch to comment your work in the related thread, but I would leave this issue open until a solution is found for the min_count option.

rpnaut changed the title ~~Sum of NA's: resampling with method 'sum' gives 0 instead of NA~~ Inconsistency between Sum of NA's and Mean of NA's: resampling gives 0 or 'NA' Jun 13, 2018

fujiisoup mentioned this issue Jun 18, 2018

Refactor nanops #2236

Merged

4 tasks

fujiisoup closed this as completed in #2236 Aug 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency between Sum of NA's and Mean of NA's: resampling gives 0 or 'NA' #2230

Inconsistency between Sum of NA's and Mean of NA's: resampling gives 0 or 'NA' #2230

rpnaut commented Jun 13, 2018 •

edited

Loading

fujiisoup commented Jun 13, 2018 •

edited

Loading

rpnaut commented Jun 13, 2018 •

edited

Loading

shoyer commented Jun 13, 2018

shoyer commented Jun 13, 2018

rpnaut commented Jun 14, 2018 •

edited

Loading

fujiisoup commented Jun 18, 2018

rpnaut commented Jun 21, 2018

Inconsistency between Sum of NA's and Mean of NA's: resampling gives 0 or 'NA' #2230

Inconsistency between Sum of NA's and Mean of NA's: resampling gives 0 or 'NA' #2230

Comments

rpnaut commented Jun 13, 2018 • edited Loading

Problem description

Data example

fujiisoup commented Jun 13, 2018 • edited Loading

rpnaut commented Jun 13, 2018 • edited Loading

shoyer commented Jun 13, 2018

shoyer commented Jun 13, 2018

rpnaut commented Jun 14, 2018 • edited Loading

fujiisoup commented Jun 18, 2018

rpnaut commented Jun 21, 2018

rpnaut commented Jun 13, 2018 •

edited

Loading

fujiisoup commented Jun 13, 2018 •

edited

Loading

rpnaut commented Jun 13, 2018 •

edited

Loading

rpnaut commented Jun 14, 2018 •

edited

Loading