-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Inconsistency between Sum of NA's and Mean of NA's: resampling gives 0 or 'NA' #2230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thank you for raising an issue. As similar to |
I can overcome this by using
BUT THE PROBLEM IS: A) that this behaviour is in contradiction to the computation of a mean. I can always compute a mean with the default option 'skipna=True' regardless I have a few NA's in the timeseries (the output is a number not considering the NA's) or only NA's in the timeseries (the output is NA). This is what i would expect. B) that setting `skipna=False' does not allow for computations if only one value of the timeseries is NA. I would like to have the behaviour of the mean operator also for the sum operator. Also for the climate data operators (CDO) the developers decided to give the users two options, skipna=True and skipna=False. But skipna == TRUE should result in the same behaviour for both operators (mean and sum). |
The difference between
This is consistent with the sum of an empty set being 0, e.g.,
The reason why a "NA skipping mean" is different in the case of all NaN inputs is that the mean simply isn't well defined on an empty set. The mean would literally be a sum of zero divided by a count of zero, which is not a valid number: the literal meaning of NaN as "not a number". There was a long discussion/debate about this recently in pandas. See pandas-dev/pandas#18678 and links there-in. There are certainly use-cases where it is nicer for the sum of all NaN outputs to be NaN (exactly as you mention here), but ultimately pandas decided that the answer for this operation should be zero. The decisive considerations were simplicity and consistency with other tools (including NumPy and R). What pandas added to solve this use-case is an optional |
OK, I see you already saw the pandas issues :).
Yes, I would be very open to adding a We could probably copy the implementation of In xarray this would go into |
I really have problems in reading the code in duck_array_ops.py. The program starts with defining 12 operators. One of them is:
I really do not understand where the train is going. Thats due to my limited programming skills for object-oriented code. No guess what '_create_nan_agg_method' is doing. I tried to change the code in method
but it seems that he will not touch that method during the 'resample().sum()' process. I need some help to really modify the operators. Is there any hint for me? For the pandas code it seems to be much easier. |
Thank you for considering that issue in your pull request #2236. |
Problem description
For datamining with xarray there is always the following issue with the resampling-method.
If i resample e.g. a daily timeseries over one month and if the data are 'NA' at each day, I get zero as a result. That is annoying considering a timeseries of precipitation. It is definitely a difference if the monthly precipitation is zero for one month (each day zero precipitation) or the monthly precipitation was not measured due to problems with the device (each day NA)
Data example
I have a dataset with hourly values for 5 month 'fcut'.
Doing a resample process gives only zero values for each month.
But I expect to have NA for each month, as it is the case for the operator 'mean'
I know that there is an ongoing discussion about that topic (see for example pandas-dev/pandas#9422).
For earth science it would be nice to have an option telling xarray what to do in case of a sum over values being all NA. Do you see a chance to have a fast fix for that issue in the model code?
The text was updated successfully, but these errors were encountered: