Skip to content

Combine_by_coords not working on named DataArrays where the data is a Dask Array. #5833

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
anlavandier opened this issue Sep 30, 2021 · 3 comments · Fixed by #5834
Closed
Labels

Comments

@anlavandier
Copy link

What happened:
xr.combine_by_coords failed (only when the arrays are named)
What you expected to happen:
xr.combine_by_coords to work as intended.
Minimal Complete Verifiable Example:

import xarray as xr
import dask.array as da
import numpy as np


coords = [("x", np.arange(200)),("y", np.arange(1000)),("z", np.arange(1000))]

DataArray_list = []

n= 1 

for i in range(n):
    test_data = da.random.random((1,200,1000,1000))
    coords_i = [("time",[i])] + coords 
    data_i = xr.DataArray(test_data,coords = coords_i)
    #data_i.name = None
    
    DataArray_list.append(data_i)

print(*DataArray_list,sep = '\n\n')

Combined = xr.Dataset()
Combined["test"] = xr.combine_by_coords(DataArray_list)

When n == 1:

runcell(0, '/home/alavandier/bug_combine_by_coords.py')
<xarray.DataArray 'random_sample-4545ef044176a8b440a43599b310e9c1' (time: 1, x: 200, y: 1000, z: 1000)>
dask.array<random_sample, shape=(1, 200, 1000, 1000), dtype=float64, chunksize=(1, 200, 250, 250), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) int64 0
  * x        (x) int64 0 1 2 3 4 5 6 7 8 ... 191 192 193 194 195 196 197 198 199
  * y        (y) int64 0 1 2 3 4 5 6 7 8 ... 991 992 993 994 995 996 997 998 999
  * z        (z) int64 0 1 2 3 4 5 6 7 8 ... 991 992 993 994 995 996 997 998 999
/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/site-packages/dask/array/core.py:383: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  o = func(*args, **kwargs)
Traceback (most recent call last):

  File "/home/alavandier/bug_combine_by_coords.py", line 21, in <module>
    Combined["test"] = xr.combine_by_coords(DataArray_list)

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/site-packages/xarray/core/dataset.py", line 1563, in __setitem__
    self.update({key: value})

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/site-packages/xarray/core/dataset.py", line 4208, in update
    merge_result = dataset_update_method(self, other)

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/site-packages/xarray/core/merge.py", line 984, in dataset_update_method
    return merge_core(

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/site-packages/xarray/core/merge.py", line 632, in merge_core
    collected = collect_variables_and_indexes(aligned)

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/site-packages/xarray/core/merge.py", line 294, in collect_variables_and_indexes
    variable = as_variable(variable, name=name)

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/site-packages/xarray/core/variable.py", line 141, in as_variable
    data = as_compatible_data(obj)

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/site-packages/xarray/core/variable.py", line 238, in as_compatible_data
    data = np.asarray(data)

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/site-packages/numpy/core/_asarray.py", line 102, in asarray
    return array(a, dtype, copy=False, order=order)

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/site-packages/xarray/core/dataset.py", line 1461, in __array__
    raise TypeError(

TypeError: cannot directly convert an xarray.Dataset into a numpy array. Instead, create an xarray.DataArray first, either with indexing on the Dataset or by invoking the `to_array()` method.

When n>=2:

runcell(0, '/home/alavandier/bug_combine_by_coords.py')
<xarray.DataArray 'random_sample-8a3680be28e920d13cc66464a1ef1669' (time: 1, x: 200, y: 1000, z: 1000)>
dask.array<random_sample, shape=(1, 200, 1000, 1000), dtype=float64, chunksize=(1, 200, 250, 250), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) int64 0
  * x        (x) int64 0 1 2 3 4 5 6 7 8 ... 191 192 193 194 195 196 197 198 199
  * y        (y) int64 0 1 2 3 4 5 6 7 8 ... 991 992 993 994 995 996 997 998 999
  * z        (z) int64 0 1 2 3 4 5 6 7 8 ... 991 992 993 994 995 996 997 998 999

<xarray.DataArray 'random_sample-991bff72d4c572ef8bd3a9f08308cc19' (time: 1, x: 200, y: 1000, z: 1000)>
dask.array<random_sample, shape=(1, 200, 1000, 1000), dtype=float64, chunksize=(1, 200, 250, 250), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) int64 1
  * x        (x) int64 0 1 2 3 4 5 6 7 8 ... 191 192 193 194 195 196 197 198 199
  * y        (y) int64 0 1 2 3 4 5 6 7 8 ... 991 992 993 994 995 996 997 998 999
  * z        (z) int64 0 1 2 3 4 5 6 7 8 ... 991 992 993 994 995 996 997 998 999
Traceback (most recent call last):

  File "/home/alavandier/bug_combine_by_coords.py", line 21, in <module>
    Combined["test"] = xr.combine_by_coords(DataArray_list)

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/site-packages/xarray/core/combine.py", line 891, in combine_by_coords
    sorted_datasets = sorted(data_objects, key=vars_as_keys)

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/site-packages/xarray/core/common.py", line 129, in __bool__
    return bool(self.values)

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Anything else we need to know?:

  • Uncommenting the line data_i.name = None fixes everything.
  • By manually interrupting xr.combine_by_coords when n == 2 before it fails on its own, we can see that it actually computes the dask arrays which is also a problem. Here's an example to show that.
runcell(0, '/home/alavandier/bug_combine_by_coords.py')
<xarray.DataArray 'random_sample-85eafb2cca5305a2d75153f0df7aca91' (time: 1, x: 200, y: 1000, z: 1000)>
dask.array<random_sample, shape=(1, 200, 1000, 1000), dtype=float64, chunksize=(1, 200, 250, 250), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) int64 0
  * x        (x) int64 0 1 2 3 4 5 6 7 8 ... 191 192 193 194 195 196 197 198 199
  * y        (y) int64 0 1 2 3 4 5 6 7 8 ... 991 992 993 994 995 996 997 998 999
  * z        (z) int64 0 1 2 3 4 5 6 7 8 ... 991 992 993 994 995 996 997 998 999

<xarray.DataArray 'random_sample-e4ed3ea4a1d6918599ccba99f02e2d9e' (time: 1, x: 200, y: 1000, z: 1000)>
dask.array<random_sample, shape=(1, 200, 1000, 1000), dtype=float64, chunksize=(1, 200, 250, 250), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) int64 1
  * x        (x) int64 0 1 2 3 4 5 6 7 8 ... 191 192 193 194 195 196 197 198 199
  * y        (y) int64 0 1 2 3 4 5 6 7 8 ... 991 992 993 994 995 996 997 998 999
  * z        (z) int64 0 1 2 3 4 5 6 7 8 ... 991 992 993 994 995 996 997 998 999
Traceback (most recent call last):

  File "/home/alavandier/bug_combine_by_coords.py", line 21, in <module>
    Combined["test"] = xr.combine_by_coords(DataArray_list)

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/site-packages/xarray/core/combine.py", line 891, in combine_by_coords
    sorted_datasets = sorted(data_objects, key=vars_as_keys)

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/site-packages/xarray/core/common.py", line 129, in __bool__
    return bool(self.values)

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/site-packages/xarray/core/dataarray.py", line 651, in values
    return self.variable.values

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/site-packages/xarray/core/variable.py", line 517, in values
    return _as_array_or_item(self._data)

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/site-packages/xarray/core/variable.py", line 259, in _as_array_or_item
    data = np.asarray(data)

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/site-packages/numpy/core/_asarray.py", line 102, in asarray
    return array(a, dtype, copy=False, order=order)

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/site-packages/dask/array/core.py", line 1476, in __array__
    x = self.compute()

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/site-packages/dask/base.py", line 285, in compute
    (result,) = compute(self, traverse=False, **kwargs)

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/site-packages/dask/base.py", line 567, in compute
    results = schedule(dsk, keys, **kwargs)

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/site-packages/dask/threaded.py", line 79, in get
    results = get_async(

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/site-packages/dask/local.py", line 503, in get_async
    for key, res_info, failed in queue_get(queue).result():

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/site-packages/dask/local.py", line 134, in queue_get
    return q.get()

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/queue.py", line 171, in get
    self.not_empty.wait()

  File "/home/alavandier/anaconda3/envs/dask_env/lib/python3.9/threading.py", line 312, in wait
    waiter.acquire()

KeyboardInterrupt

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.9.7 (default, Sep 16 2021, 13:09:58)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-88-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: ('fr_FR', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.6.1

xarray: 0.19.0
pandas: 1.3.2
numpy: 1.20.3
scipy: 1.6.2
netCDF4: 1.5.7
pydap: None
h5netcdf: None
h5py: 3.1.0
Nio: None
zarr: None
cftime: 1.5.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2021.04.1
distributed: 2021.04.1
matplotlib: 3.3.4
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 58.0.4
pip: 21.2.4
conda: None
pytest: None
IPython: 7.27.0
sphinx: 4.2.0

@TomNicholas
Copy link
Member

TomNicholas commented Sep 30, 2021

Thanks for raising this @anlavandier! And thanks especially for the clear reproducible example. You've brought up 3 specific issues, so in turn:

  1. TypeError for n=1

This one is actually not a bug in combine, it's merely a slightly unclear (but valid) error message in Dataset.__setitem__.

If you change your example script to create the result before attempting to add it to the Dataset, i.e.

result = xr.combine_by_coords(dataarray_list)

print(result)

combined = xr.Dataset()
combined["test"] = result

you can see that combine_by_coords actually works fine in this case:

<xarray.DataArray (time: 1, x: 200, y: 1000, z: 1000)>
dask.array<random_sample, shape=(1, 200, 1000, 1000), dtype=float64, chunksize=(1, 200, 250, 250), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) int64 0
  * x        (x) int64 0 1 2 3 4 5 6 7 8 ... 191 192 193 194 195 196 197 198 199
  * y        (y) int64 0 1 2 3 4 5 6 7 8 ... 991 992 993 994 995 996 997 998 999
  * z        (z) int64 0 1 2 3 4 5 6 7 8 ... 991 992 993 994 995 996 997 998 999

However what is happening is that combining unnamed dataarrays returns a DataArray (rather than a Dataset). The fact that combine_by_coords can even return a DataArray is not obvious because the docstring needs changing (I did change it in #5519 , I just haven't merged it yet.)

When you try to assign this object to your empty Dataset, it will happily assign a DataArray but will fail when trying to assign a Dataset as a variable of a Dataset (as it should). That's why your script behaves differently for named vs unnamed dataarrays.

As an aside the error message for Dataset.__setitem__ should be clearer - it should just check for an xarray.Dataset and then immediately raise, instead of trying to coerce it into a numpy array and raising when it fails. I'll make another PR for this too.

  1. ValueError for n=2

This is a real bug, introduced in #4696 . It happens because the combine internals assume everything is a Dataset, but there was no check to promote named DataArrays to single-variable Datasets. Luckily the fix is simple, and I've implemented it along with a minor refactor in #5834

  1. Dask compute triggered

This does happen with n=2, but I think the fix for (2) also fixes this.

Thanks again for raising this! Let me know if you think #5834 hasn't fully fixed your problem :)

@anlavandier
Copy link
Author

Thank you for your response.
Indeed, promoting the DataArrays to single-variable Datasets before using xr.combine_by_coords does fix both the ValueError and does not trigger computation.

I don't know if we should wait for #5834 to be merged to close the issue or do it already but as far as I'm concerned it's fixed.

@TomNicholas
Copy link
Member

Great.

I don't know if we should wait for #5834 to be merged to close the issue or do it already but as far as I'm concerned it's fixed.

I'll close this once #5834 is merged. (But we typically wait for someone else to review any code before merging)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants