Skip to content

Commit deb2082

Browse files
authored
Improve zarr chunks docs (#9140)
* Improve zarr chunks docs Makes them more structure, consistent. I think removes a mistake re the default chunks arg in `open_zarr` (it's not `None`, it's `auto`). Adds a comment re performance with `chunks=None`, closing #9111
1 parent 2645d7f commit deb2082

File tree

3 files changed

+40
-23
lines changed

3 files changed

+40
-23
lines changed

doc/whats-new.rst

+2
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,8 @@ Bug fixes
4040
Documentation
4141
~~~~~~~~~~~~~
4242

43+
- Improvements to Zarr & chunking docs (:pull:`9139`, :pull:`9140`, :pull:`9132`)
44+
By `Maximilian Roos <https://github.com/max-sixty>`_
4345

4446
Internal Changes
4547
~~~~~~~~~~~~~~~~

xarray/backends/api.py

+26-17
Original file line numberDiff line numberDiff line change
@@ -425,15 +425,19 @@ def open_dataset(
425425
is chosen based on available dependencies, with a preference for
426426
"netcdf4". A custom backend class (a subclass of ``BackendEntrypoint``)
427427
can also be used.
428-
chunks : int, dict, 'auto' or None, optional
429-
If chunks is provided, it is used to load the new dataset into dask
430-
arrays. ``chunks=-1`` loads the dataset with dask using a single
431-
chunk for all arrays. ``chunks={}`` loads the dataset with dask using
432-
engine preferred chunks if exposed by the backend, otherwise with
433-
a single chunk for all arrays. In order to reproduce the default behavior
434-
of ``xr.open_zarr(...)`` use ``xr.open_dataset(..., engine='zarr', chunks={})``.
435-
``chunks='auto'`` will use dask ``auto`` chunking taking into account the
436-
engine preferred chunks. See dask chunking for more details.
428+
chunks : int, dict, 'auto' or None, default: None
429+
If provided, used to load the data into dask arrays.
430+
431+
- ``chunks="auto"`` will use dask ``auto`` chunking taking into account the
432+
engine preferred chunks.
433+
- ``chunks=None`` skips using dask, which is generally faster for
434+
small arrays.
435+
- ``chunks=-1`` loads the data with dask using a single chunk for all arrays.
436+
- ``chunks={}`` loads the data with dask using the engine's preferred chunk
437+
size, generally identical to the format's chunk size. If not available, a
438+
single chunk for all arrays.
439+
440+
See dask chunking for more details.
437441
cache : bool, optional
438442
If True, cache data loaded from the underlying datastore in memory as
439443
NumPy arrays when accessed to avoid reading from the underlying data-
@@ -631,14 +635,19 @@ def open_dataarray(
631635
Engine to use when reading files. If not provided, the default engine
632636
is chosen based on available dependencies, with a preference for
633637
"netcdf4".
634-
chunks : int, dict, 'auto' or None, optional
635-
If chunks is provided, it is used to load the new dataset into dask
636-
arrays. ``chunks=-1`` loads the dataset with dask using a single
637-
chunk for all arrays. `chunks={}`` loads the dataset with dask using
638-
engine preferred chunks if exposed by the backend, otherwise with
639-
a single chunk for all arrays.
640-
``chunks='auto'`` will use dask ``auto`` chunking taking into account the
641-
engine preferred chunks. See dask chunking for more details.
638+
chunks : int, dict, 'auto' or None, default: None
639+
If provided, used to load the data into dask arrays.
640+
641+
- ``chunks='auto'`` will use dask ``auto`` chunking taking into account the
642+
engine preferred chunks.
643+
- ``chunks=None`` skips using dask, which is generally faster for
644+
small arrays.
645+
- ``chunks=-1`` loads the data with dask using a single chunk for all arrays.
646+
- ``chunks={}`` loads the data with dask using engine preferred chunks if
647+
exposed by the backend, otherwise with a single chunk for all arrays.
648+
649+
See dask chunking for more details.
650+
642651
cache : bool, optional
643652
If True, cache data loaded from the underlying datastore in memory as
644653
NumPy arrays when accessed to avoid reading from the underlying data-

xarray/backends/zarr.py

+12-6
Original file line numberDiff line numberDiff line change
@@ -973,12 +973,18 @@ def open_zarr(
973973
Array synchronizer provided to zarr
974974
group : str, optional
975975
Group path. (a.k.a. `path` in zarr terminology.)
976-
chunks : int or dict or tuple or {None, 'auto'}, optional
977-
Chunk sizes along each dimension, e.g., ``5`` or
978-
``{'x': 5, 'y': 5}``. If `chunks='auto'`, dask chunks are created
979-
based on the variable's zarr chunks. If `chunks=None`, zarr array
980-
data will lazily convert to numpy arrays upon access. This accepts
981-
all the chunk specifications as Dask does.
976+
chunks : int, dict, 'auto' or None, default: 'auto'
977+
If provided, used to load the data into dask arrays.
978+
979+
- ``chunks='auto'`` will use dask ``auto`` chunking taking into account the
980+
engine preferred chunks.
981+
- ``chunks=None`` skips using dask, which is generally faster for
982+
small arrays.
983+
- ``chunks=-1`` loads the data with dask using a single chunk for all arrays.
984+
- ``chunks={}`` loads the data with dask using engine preferred chunks if
985+
exposed by the backend, otherwise with a single chunk for all arrays.
986+
987+
See dask chunking for more details.
982988
overwrite_encoded_chunks : bool, optional
983989
Whether to drop the zarr chunks encoded for each variable when a
984990
dataset is loaded with specified chunk sizes (default: False)

0 commit comments

Comments
 (0)