Skip to content

Dask error on xarray.corr #5715

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Gijom opened this issue Aug 19, 2021 · 5 comments · Fixed by #5731
Closed

Dask error on xarray.corr #5715

Gijom opened this issue Aug 19, 2021 · 5 comments · Fixed by #5731
Labels

Comments

@Gijom
Copy link
Contributor

Gijom commented Aug 19, 2021

What happened:
When I use xarray.corr on two Dataarrays I get a NameError: name 'dask' is not defined error. Notice that dask is not installed in my environement.

What you expected to happen:
Obtain the correlation values without dask interfering (as it should be optional in my understanding)

Minimal Complete Verifiable Example:

N = 100
ds = xr.Dataset(
    data_vars={
        'x': ('t', np.random.randn(N)),
        'y': ('t', np.random.randn(N))
    },
    coords={
        't': range(N)
    }
)
xr.corr(ds['y'], ds['x'])

Results in:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_732567/1992585666.py in <module>
----> 1 xr.corr(ds['y'], ds['x'])

~/.local/share/virtualenvs/e-sport-ml-IJ_mJ64l/lib/python3.8/site-packages/xarray/core/computation.py in corr(da_a, da_b, dim)
   1343         )
   1344 
-> 1345     return _cov_corr(da_a, da_b, dim=dim, method="corr")
   1346 
   1347 

~/.local/share/virtualenvs/e-sport-ml-IJ_mJ64l/lib/python3.8/site-packages/xarray/core/computation.py in _cov_corr(da_a, da_b, dim, ddof, method)
   1371             return da
   1372 
-> 1373     da_a = da_a.map_blocks(_get_valid_values, args=[da_b])
   1374     da_b = da_b.map_blocks(_get_valid_values, args=[da_a])
   1375 

~/.local/share/virtualenvs/e-sport-ml-IJ_mJ64l/lib/python3.8/site-packages/xarray/core/dataarray.py in map_blocks(self, func, args, kwargs, template)
   3811         from .parallel import map_blocks
   3812 
-> 3813         return map_blocks(func, self, args, kwargs, template)
   3814 
   3815     def polyfit(

~/.local/share/virtualenvs/e-sport-ml-IJ_mJ64l/lib/python3.8/site-packages/xarray/core/parallel.py in map_blocks(func, obj, args, kwargs, template)
    332             )
    333 
--> 334     if not dask.is_dask_collection(obj):
    335         return func(obj, *args, **kwargs)
    336 

NameError: name 'dask' is not defined

Environment:

INSTALLED VERSIONS
------------------
commit: None
python: 3.8.6 (default, Dec 16 2020, 11:33:05) 
[GCC 10.2.0]
python-bits: 64
OS: Linux
OS-release: 5.13.6-arch1-1
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.0
libnetcdf: None

xarray: 0.19.0
pandas: 1.3.1
numpy: 1.21.1
scipy: 1.7.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: 3.3.0
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.4.2
cartopy: None
seaborn: 0.11.1
numbagg: None
pint: None
setuptools: 51.0.0
pip: 20.3.1
conda: None
pytest: None
IPython: 7.26.0
sphinx: None
@max-sixty
Copy link
Collaborator

Thanks @Gijom , I can repro.

I think the fix should be fairly easy, if someone wants to take a swing. I'm not sure why the existing tests don't cover it.

@Gijom
Copy link
Contributor Author

Gijom commented Aug 20, 2021

The responsible code for the error originally comes from the call to da_a = da_a.map_blocks(_get_valid_values, args=[da_b]), which aim is to remove nan values from both DataArrays. I am confused by this given that the code lines below seems to accumplish something similar (despite of the comment saying it should not):

# 4. Compute covariance along the given dim
# N.B. `skipna=False` is required or there is a bug when computing
# auto-covariance. E.g. Try xr.cov(da,da) for
# da = xr.DataArray([[1, 2], [1, np.nan]], dims=["x", "time"])
cov = (demeaned_da_a * demeaned_da_b).sum(dim=dim, skipna=True, min_count=1) / (
    valid_count
)

In any case, the parrallel module imports dask in a try catch block to ignore the import error. So this is not a surprise that when using dask latter there is an error if it was not imported. I can see two possibilities:

  • encapsulate all dask calls in a similar try/catch block
  • set a boolean in the first place and do the tests only if dask is correctly imported

Now I do not have any big picure there so there are probably better solutions.

@Gijom
Copy link
Contributor Author

Gijom commented Aug 20, 2021

I had a look to it this morning and I think I managed to solve the issue by replacing the calls to dask.is_dask_collection by is_duck_dask_array from the pycompat module.

For (successful) testing I used the same code as above plus the following:

ds_dask = ds.chunk({"t": 10})

yy = xr.corr(ds['y'], ds['y']).to_numpy()
yy_dask = xr.corr(ds_dask['y'], ds_dask['y']).to_numpy()
yx = xr.corr(ds['y'], ds['x']).to_numpy()
yx_dask = xr.corr(ds_dask['y'], ds_dask['x']).to_numpy()
np.testing.assert_allclose(yy, yy_dask), "YY: {} is different from {}".format(yy, yy_dask)
np.testing.assert_allclose(yx, yx_dask), "YX: {} is different from {}".format(yx, yx_dask)

The results are not exactly identical but almost which is probably due to numerical approximations of multiple computations in the dask case.

I also tested the correlation of simple DataArrays without dask installed and the result seem coherent (close to 0 for uncorrelated data and very close to 1 when correlating identical variables).

Should I make a pull request ? Should I implement this test ? Any others ?

@max-sixty
Copy link
Collaborator

That sounds great @Gijom ! Thanks for working through that. A PR would be welcome!

In the tests, we should be running this outside a @requires_dask decorated test, so that this gets tested without dask. If you can find that, great, otherwise someone can help.

@dcherian
Copy link
Contributor

Thanks @Gijom this is a dupe of #3391

Please send your changes in a pull request, we'd be happy to merge it after reviewing. Thank you!

Gijom added a commit to Gijom/xarray that referenced this issue Aug 23, 2021
Calls to dask functions are replaced by calls to the
pycompat functions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants