Skip to content

Dataset.mean changes variables without specified dimension #4885

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dschwoerer opened this issue Feb 10, 2021 · 2 comments · Fixed by #5207
Closed

Dataset.mean changes variables without specified dimension #4885

dschwoerer opened this issue Feb 10, 2021 · 2 comments · Fixed by #5207

Comments

@dschwoerer
Copy link
Contributor

What happened:
If I apply mean(dim='time') on a dataset, variables without that dimension are changed.

What you expected to happen:
Variables without the dimension are not changed.

Minimal Complete Verifiable Example:

import xarray as xr

ds = xr.Dataset()
ds["pos"] = [1, 2, 3]
ds["data"] = ("pos", "time"), [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]
ds["var"] = "pos", [2, 3, 4]
print(ds.mean(dim="time"))

Anything else we need to know?:
That makes it unnecessarily slow, as variables without that dimensions wouldn't need to be read from disk.
It is easy enough to work around:

ds2 = ds.copy()
for k in ds:
    if "time" in ds[k].dims:
        ds2[k] = ds[k].mean(dim="time")

However I cannot see why dataset should change the variables without the specified dim.

Environment:

Output of xr.show_versions() INSTALLED VERSIONS ------------------

commit: None
python: 3.9.1 (default, Jan 20 2021, 00:00:00)
[GCC 10.2.1 20201125 (Red Hat 10.2.1-9)]
python-bits: 64
OS: Linux
OS-release: 5.10.13-200.fc33.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.6
libnetcdf: 4.7.3

xarray: 0.16.2
pandas: 1.0.5
numpy: 1.19.4
scipy: 1.5.2
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.1.3
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.2.1
dask: None
distributed: None
matplotlib: 3.3.4
cartopy: None
seaborn: None
numbagg: None
pint: 0.13
setuptools: 49.1.3
pip: 20.2.2
conda: None
pytest: 6.0.2
IPython: 7.18.1
sphinx: 3.2.1

@keewis
Copy link
Collaborator

keewis commented Feb 13, 2021

the code that controls this is here:

xarray/xarray/core/dataset.py

Lines 4676 to 4702 in f3f0a14

for name, var in self._variables.items():
reduce_dims = [d for d in var.dims if d in dims]
if name in self.coords:
if not reduce_dims:
variables[name] = var
else:
if (
not numeric_only
or np.issubdtype(var.dtype, np.number)
or (var.dtype == np.bool_)
):
if len(reduce_dims) == 1:
# unpack dimensions for the benefit of functions
# like np.argmin which can't handle tuple arguments
(reduce_dims,) = reduce_dims
elif len(reduce_dims) == var.ndim:
# prefer to aggregate over axis=None rather than
# axis=(0, 1) if they will be equivalent, because
# the former is often more efficient
reduce_dims = None # type: ignore
variables[name] = var.reduce(
func,
dim=reduce_dims,
keep_attrs=keep_attrs,
keepdims=keepdims,
**kwargs,
)

We could easily skip processing if not reduce_dims for data variables, too. Does anyone know why only coords are skipped?

@dschwoerer
Copy link
Contributor Author

I tried this:

--- a/xarray/core/dataset.py
+++ b/xarray/core/dataset.py
@@ -4701,7 +4701,9 @@ class Dataset(Mapping, ImplementsDatasetReduce, DataWithCoords):
                 if not reduce_dims:
                     variables[name] = var
             else:
-                if (
+                if not reduce_dims:
+                    variables[name] = var
+                elif (
                     not numeric_only
                     or np.issubdtype(var.dtype, np.number)
                     or (var.dtype == np.bool_)

which works great for mean - "var" stays an integer, as expected.

However, that breaks ds.std - which should be zero for "var", but isn't.
I guess that is ok for coords - as the assumption is that on coordinates the calculation is not done, but for data variables this is probably not ok.

--- a/xarray/core/duck_array_ops.py
+++ b/xarray/core/duck_array_ops.py
@@ -537,6 +537,11 @@ def mean(array, axis=None, skipna=None, **kwargs):
     dtypes"""
     from .common import _contains_cftime_datetimes
 
+    # The mean over an empty axis shouldn't change the data
+    # See https://github.com/pydata/xarray/issues/4885
+    if not axis:
+        return array
+
     array = asarray(array)
     if array.dtype.kind in "Mm":
         offset = _datetime_nanmin(array)

I think it is best to change mean - which would work also for dataArrays.
This implies that mean does not convert to float64 - as the numpy version does, but I guess that should be fine.

Should I open a PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants