Dataset.mean changes variables without specified dimension #4885

dschwoerer · 2021-02-10T10:37:07Z

What happened:
If I apply mean(dim='time') on a dataset, variables without that dimension are changed.

What you expected to happen:
Variables without the dimension are not changed.

Minimal Complete Verifiable Example:

import xarray as xr

ds = xr.Dataset()
ds["pos"] = [1, 2, 3]
ds["data"] = ("pos", "time"), [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]
ds["var"] = "pos", [2, 3, 4]
print(ds.mean(dim="time"))

Anything else we need to know?:
That makes it unnecessarily slow, as variables without that dimensions wouldn't need to be read from disk.
It is easy enough to work around:

ds2 = ds.copy()
for k in ds:
    if "time" in ds[k].dims:
        ds2[k] = ds[k].mean(dim="time")

However I cannot see why dataset should change the variables without the specified dim.

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS ------------------

commit: None
python: 3.9.1 (default, Jan 20 2021, 00:00:00)
[GCC 10.2.1 20201125 (Red Hat 10.2.1-9)]
python-bits: 64
OS: Linux
OS-release: 5.10.13-200.fc33.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.6
libnetcdf: 4.7.3

xarray: 0.16.2
pandas: 1.0.5
numpy: 1.19.4
scipy: 1.5.2
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.1.3
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.2.1
dask: None
distributed: None
matplotlib: 3.3.4
cartopy: None
seaborn: None
numbagg: None
pint: 0.13
setuptools: 49.1.3
pip: 20.2.2
conda: None
pytest: 6.0.2
IPython: 7.18.1
sphinx: 3.2.1

The text was updated successfully, but these errors were encountered:

keewis · 2021-02-13T23:48:24Z

the code that controls this is here:

xarray/xarray/core/dataset.py

Lines 4676 to 4702 in f3f0a14

    
           for name, var in self._variables.items(): 
        
               reduce_dims = [d for d in var.dims if d in dims] 
        
               if name in self.coords: 
        
                   if not reduce_dims: 
        
                       variables[name] = var 
        
               else: 
        
                   if ( 
        
                       not numeric_only 
        
                       or np.issubdtype(var.dtype, np.number) 
        
                       or (var.dtype == np.bool_) 
        
                   ): 
        
                       if len(reduce_dims) == 1: 
        
                           # unpack dimensions for the benefit of functions 
        
                           # like np.argmin which can't handle tuple arguments 
        
                           (reduce_dims,) = reduce_dims 
        
                       elif len(reduce_dims) == var.ndim: 
        
                           # prefer to aggregate over axis=None rather than 
        
                           # axis=(0, 1) if they will be equivalent, because 
        
                           # the former is often more efficient 
        
                           reduce_dims = None  # type: ignore 
        
                       variables[name] = var.reduce( 
        
                           func, 
        
                           dim=reduce_dims, 
        
                           keep_attrs=keep_attrs, 
        
                           keepdims=keepdims, 
        
                           **kwargs, 
        
                       )

We could easily skip processing if not reduce_dims for data variables, too. Does anyone know why only coords are skipped?

dschwoerer · 2021-03-01T15:47:52Z

I tried this:

--- a/xarray/core/dataset.py
+++ b/xarray/core/dataset.py
@@ -4701,7 +4701,9 @@ class Dataset(Mapping, ImplementsDatasetReduce, DataWithCoords):
                 if not reduce_dims:
                     variables[name] = var
             else:
-                if (
+                if not reduce_dims:
+                    variables[name] = var
+                elif (
                     not numeric_only
                     or np.issubdtype(var.dtype, np.number)
                     or (var.dtype == np.bool_)

which works great for mean - "var" stays an integer, as expected.

However, that breaks ds.std - which should be zero for "var", but isn't.
I guess that is ok for coords - as the assumption is that on coordinates the calculation is not done, but for data variables this is probably not ok.

--- a/xarray/core/duck_array_ops.py
+++ b/xarray/core/duck_array_ops.py
@@ -537,6 +537,11 @@ def mean(array, axis=None, skipna=None, **kwargs):
     dtypes"""
     from .common import _contains_cftime_datetimes
 
+    # The mean over an empty axis shouldn't change the data
+    # See https://github.com/pydata/xarray/issues/4885
+    if not axis:
+        return array
+
     array = asarray(array)
     if array.dtype.kind in "Mm":
         offset = _datetime_nanmin(array)

I think it is best to change mean - which would work also for dataArrays.
This implies that mean does not convert to float64 - as the numpy version does, but I guess that should be fine.

Should I open a PR?

dschwoerer mentioned this issue Apr 22, 2021

Skip mean over empty axis #5207

Merged

4 tasks

max-sixty closed this as completed in #5207 Apr 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset.mean changes variables without specified dimension #4885

Dataset.mean changes variables without specified dimension #4885

dschwoerer commented Feb 10, 2021

keewis commented Feb 13, 2021

dschwoerer commented Mar 1, 2021

Dataset.mean changes variables without specified dimension #4885

Dataset.mean changes variables without specified dimension #4885

Comments

dschwoerer commented Feb 10, 2021

keewis commented Feb 13, 2021

dschwoerer commented Mar 1, 2021