Skip to content

Is it unsafe to write on independent regions using the to_zarr method when one of chunks is not complete? #9072

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 of 5 tasks
josephnowak opened this issue Jun 6, 2024 · 4 comments

Comments

@josephnowak
Copy link
Contributor

josephnowak commented Jun 6, 2024

What happened?

I'm trying to update a region of my Zarr array, but it is raising the following error:

ValueError: Specified zarr chunks encoding['chunks']=(2,) for variable named '__xarray_dataarray_variable__' would overlap multiple dask chunks ((1, 1),). 
Writing this array in parallel with dask could lead to corrupted data. 
Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.

For me it does not make so much sense, because I'm writing in independent chunks, even if the first chunk is not completely being updated it should not corrupt data because there is no more than one process writing to a single chunk at the time, so it should be safe, or at least that's my understanding, could you clarify to me if this is the expected behavior? or if I'm doing something wrong when I try to write?

What did you expect to happen?

No response

Minimal Complete Verifiable Example

import xarray as xr

def clear_encoding(arr):
    arr.encoding.clear()
    for dim in arr.dims:
        arr.coords[dim].encoding.clear()
        

a = xr.DataArray([1, 2, 3], dims=["a"], coords={"a": [1, 2, 3]}).chunk({"a": 2})
b = (a * 10).isel(a=slice(1, None))
print(a.chunks, b.chunks)

clear_encoding(a)
a.to_zarr("test", mode="w")

clear_encoding(b)
# This is going to raise the following error:
# ValueError: Specified zarr chunks encoding['chunks']=(2,) for variable named '__xarray_dataarray_variable__' would overlap multiple dask chunks ((1, 1),). 
# Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.
b.to_zarr("test", region={"a": slice(1, None)})

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[79], line 9
      5 clear_encoding(b)
      6 # This is going to raise the following error:
      7 # ValueError: Specified zarr chunks encoding['chunks']=(2,) for variable named '__xarray_dataarray_variable__' would overlap multiple dask chunks ((1, 1),). 
      8 # Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.
----> 9 b.to_zarr("test", region={"a": slice(1, None)})

File ~/.local/lib/python3.11/site-packages/xarray/core/dataarray.py:4328, in DataArray.to_zarr(self, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options, zarr_version)
   4324 else:
   4325     # No problems with the name - so we're fine!
   4326     dataset = self.to_dataset()
-> 4328 return to_zarr(  # type: ignore[call-overload,misc]
   4329     dataset,
   4330     store=store,
   4331     chunk_store=chunk_store,
   4332     mode=mode,
   4333     synchronizer=synchronizer,
   4334     group=group,
   4335     encoding=encoding,
   4336     compute=compute,
   4337     consolidated=consolidated,
   4338     append_dim=append_dim,
   4339     region=region,
   4340     safe_chunks=safe_chunks,
   4341     storage_options=storage_options,
   4342     zarr_version=zarr_version,
   4343 )

File ~/.local/lib/python3.11/site-packages/xarray/backends/api.py:1697, in to_zarr(dataset, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options, zarr_version, write_empty_chunks, chunkmanager_store_kwargs)
   1695 writer = ArrayWriter()
   1696 # TODO: figure out how to properly handle unlimited_dims
-> 1697 dump_to_store(dataset, zstore, writer, encoding=encoding)
   1698 writes = writer.sync(
   1699     compute=compute, chunkmanager_store_kwargs=chunkmanager_store_kwargs
   1700 )
   1702 if compute:

File ~/.local/lib/python3.11/site-packages/xarray/backends/api.py:1384, in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims)
   1381 if encoder:
   1382     variables, attrs = encoder(variables, attrs)
-> 1384 store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)

File ~/.local/lib/python3.11/site-packages/xarray/backends/zarr.py:726, in ZarrStore.store(self, variables, attributes, check_encoding_set, writer, unlimited_dims)
    723 else:
    724     variables_to_set = variables_encoded
--> 726 self.set_variables(
    727     variables_to_set, check_encoding_set, writer, unlimited_dims=unlimited_dims
    728 )
    729 if self._consolidate_on_close:
    730     zarr.consolidate_metadata(self.zarr_group.store)

File ~/.local/lib/python3.11/site-packages/xarray/backends/zarr.py:772, in ZarrStore.set_variables(self, variables, check_encoding_set, writer, unlimited_dims)
    766     v.encoding = {}
    768 # We need to do this for both new and existing variables to ensure we're not
    769 # writing to a partial chunk, even though we don't use the `encoding` value
    770 # when writing to an existing variable. See
    771 # https://github.com/pydata/xarray/issues/8371 for details.
--> 772 encoding = extract_zarr_variable_encoding(
    773     v,
    774     raise_on_invalid=vn in check_encoding_set,
    775     name=vn,
    776     safe_chunks=self._safe_chunks,
    777 )
    779 if name in existing_keys:
    780     # existing variable
    781     # TODO: if mode="a", consider overriding the existing variable
    782     # metadata. This would need some case work properly with region
    783     # and append_dim.
    784     if self._write_empty is not None:
    785         # Write to zarr_group.chunk_store instead of zarr_group.store
    786         # See https://github.com/pydata/xarray/pull/8326#discussion_r1365311316 for a longer explanation
   (...)
    798         #     - The size of dimensions can not be expanded, that would require a call using `append_dim`
    799         #        which is mutually exclusive with `region`

File ~/.local/lib/python3.11/site-packages/xarray/backends/zarr.py:285, in extract_zarr_variable_encoding(variable, raise_on_invalid, name, safe_chunks)
    282         if k not in valid_encodings:
    283             del encoding[k]
--> 285 chunks = _determine_zarr_chunks(
    286     encoding.get("chunks"), variable.chunks, variable.ndim, name, safe_chunks
    287 )
    288 encoding["chunks"] = chunks
    289 return encoding

File ~/.local/lib/python3.11/site-packages/xarray/backends/zarr.py:199, in _determine_zarr_chunks(enc_chunks, var_chunks, ndim, name, safe_chunks)
    193                 base_error = (
    194                     f"Specified zarr chunks encoding['chunks']={enc_chunks_tuple!r} for "
    195                     f"variable named {name!r} would overlap multiple dask chunks {var_chunks!r}. "
    196                     f"Writing this array in parallel with dask could lead to corrupted data."
    197                 )
    198                 if safe_chunks:
--> 199                     raise ValueError(
    200                         base_error
    201                         + " Consider either rechunking using `chunk()`, deleting "
    202                         "or modifying `encoding['chunks']`, or specify `safe_chunks=False`."
    203                     )
    204     return enc_chunks_tuple
    206 raise AssertionError("We should never get here. Function logic must be wrong.")

ValueError: Specified zarr chunks encoding['chunks']=(2,) for variable named '__xarray_dataarray_variable__' would overlap multiple dask chunks ((1, 1),). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.8 (main, Feb 25 2024, 16:39:33) [GCC 11.4.0] python-bits: 64 OS: Linux OS-release: 6.5.0-1014-aws machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None

xarray: 2024.5.0
pandas: 2.2.2
numpy: 1.26.4
scipy: 1.11.4
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
zarr: 2.18.1
cftime: None
nc_time_axis: None
iris: None
bottleneck: 1.3.8
dask: 2024.4.2
distributed: 2024.4.2
matplotlib: 3.8.3
cartopy: None
seaborn: None
numbagg: 0.8.1
fsspec: 2024.5.0
cupy: None
pint: None
sparse: 0.15.4
flox: 0.9.6
numpy_groupies: 0.10.2
setuptools: 69.1.1
pip: 22.0.2
conda: None
pytest: 8.2.2
mypy: None
IPython: 8.22.2
sphinx: 7.2.6

@josephnowak josephnowak added bug needs triage Issue that has not been reviewed by xarray team member labels Jun 6, 2024
@josephnowak josephnowak changed the title Is it unsafe to write on independent regions using the to_zarr method when one of them is not complete? Is it unsafe to write on independent regions using the to_zarr method when one of chunks is not complete? Jun 6, 2024
@dcherian
Copy link
Contributor

dcherian commented Jun 6, 2024

This check doesn't look at the region so it should be safe and you can opt out of the check with safe_chunks=False.

@josephnowak
Copy link
Contributor Author

josephnowak commented Jun 6, 2024

If it is safe to write in the way that I specified in the code then I'm going to use the safe_chunks=False that you mentioned. Thanks.

@dcherian
Copy link
Contributor

dcherian commented Jun 6, 2024

(I think it is. but you should double check).

@dcherian dcherian closed this as completed Jun 6, 2024
@dcherian dcherian added usage question and removed bug needs triage Issue that has not been reviewed by xarray team member labels Jun 6, 2024
@josephnowak
Copy link
Contributor Author

If this helps in something, I did the following code to try to corroborate that there is no data corruption when writing in the way that I specified in the code.

import xarray as xr
import numpy as np

def clear_encoding(arr):
    arr.encoding.clear()
    for dim in arr.dims:
        arr.coords[dim].encoding.clear()

elements = list(range(1, 77))


for chunk in [2, 7, 10, 16]:
    print("chunk", chunk)
    a = xr.DataArray(elements, dims=["a"], coords={"a": elements}).chunk({"a": chunk})
    clear_encoding(a)
    a.to_zarr("test", mode="w")
    
    expected = a.copy()

    for start in [5, 8, 7, 9, 20]:
        print("start", start)
        for end in [22, 30, 45, 76]:
            print("end", end)
            for i in range(6):
                range_slice = slice(start, end)
                b = (a * 10).isel(a=range_slice)
                expected.loc[b.a[0]: b.a[-1]] = b
                
                clear_encoding(b)
                b.to_zarr("test", region={"a": range_slice}, safe_chunks=False)
                c = xr.open_zarr("test")["__xarray_dataarray_variable__"].compute()
     
                try:
                    assert expected.equals(c)
                except Exception as e:
                    print(start, end, chunk, i)
                    print(c)
                    print(expected)
                    raise e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants