From 5aae50beddb5f491610b5612d5d53c7e1e0564ad Mon Sep 17 00:00:00 2001 From: Deepak Cherian Date: Tue, 14 Nov 2023 12:21:46 -0700 Subject: [PATCH 1/2] [skip-ci] Small updates to IO docs. --- doc/user-guide/io.rst | 38 +++++++++++++++++++++----------------- 1 file changed, 21 insertions(+), 17 deletions(-) diff --git a/doc/user-guide/io.rst b/doc/user-guide/io.rst index 1aeb393f3af..2155ecfd88b 100644 --- a/doc/user-guide/io.rst +++ b/doc/user-guide/io.rst @@ -44,9 +44,9 @@ __ https://www.unidata.ucar.edu/software/netcdf/ .. _netCDF FAQ: https://www.unidata.ucar.edu/software/netcdf/docs/faq.html#What-Is-netCDF -Reading and writing netCDF files with xarray requires scipy or the -`netCDF4-Python`__ library to be installed (the latter is required to -read/write netCDF V4 files and use the compression options described below). +Reading and writing netCDF files with xarray requires scipy, h5netcdf, or the +`netCDF4-Python`__ library to be installed. SciPy only supports reading and writing +of netCDF V3 files. __ https://github.com/Unidata/netcdf4-python @@ -675,8 +675,8 @@ the same as the one that was saved. .. note:: - xarray does not write NCZarr attributes. Therefore, NCZarr data must be - opened in read-only mode. + xarray does not write `NCZarr `_ attributes. + Therefore, NCZarr data must be opened in read-only mode. To store variable length strings, convert them to object arrays first with ``dtype=object``. @@ -696,10 +696,10 @@ It is possible to read and write xarray datasets directly from / to cloud storage buckets using zarr. This example uses the `gcsfs`_ package to provide an interface to `Google Cloud Storage`_. -From v0.16.2: general `fsspec`_ URLs are parsed and the store set up for you -automatically when reading, such that you can open a dataset in a single -call. You should include any arguments to the storage backend as the -key ``storage_options``, part of ``backend_kwargs``. +General `fsspec`_ URLs, those that begin with ``s3://`` or ``gcs://`` for example, +are parsed and the store set up for you automatically when reading. +You should include any arguments to the storage backend as the +key ```storage_options``, part of ``backend_kwargs``. .. code:: python @@ -715,7 +715,7 @@ key ``storage_options``, part of ``backend_kwargs``. This also works with ``open_mfdataset``, allowing you to pass a list of paths or a URL to be interpreted as a glob string. -For older versions, and for writing, you must explicitly set up a ``MutableMapping`` +For writing, you must explicitly set up a ``MutableMapping`` instance and pass this, as follows: .. code:: python @@ -769,10 +769,10 @@ Consolidated Metadata ~~~~~~~~~~~~~~~~~~~~~ Xarray needs to read all of the zarr metadata when it opens a dataset. -In some storage mediums, such as with cloud object storage (e.g. amazon S3), +In some storage mediums, such as with cloud object storage (e.g. `Amazon S3`_), this can introduce significant overhead, because two separate HTTP calls to the object store must be made for each variable in the dataset. -As of xarray version 0.18, xarray by default uses a feature called +By default Xarray uses a feature called *consolidated metadata*, storing all metadata for the entire dataset with a single key (by default called ``.zmetadata``). This typically drastically speeds up opening the store. (For more information on this feature, consult the @@ -796,16 +796,20 @@ reads. Because this fall-back option is so much slower, xarray issues a .. _io.zarr.appending: -Appending to existing Zarr stores -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Modifying existing Zarr stores +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Xarray supports several ways of incrementally writing variables to a Zarr store. These options are useful for scenarios when it is infeasible or undesirable to write your entire dataset at once. +1. Use ``mode='a'`` to add or overwrite entire variables, +2. Use ``append_dim`` to resize and append to exiting variables, and +3. Use ``region`` to write to limited regions of existing arrays. + .. tip:: - If you can load all of your data into a single ``Dataset`` using dask, a + For ``Dataset`` objects containing dask arrays, a single call to ``to_zarr()`` will write all of your data in parallel. .. warning:: @@ -876,8 +880,8 @@ and then calling ``to_zarr`` with ``compute=False`` to write only metadata ds.to_zarr(path, compute=False) Now, a Zarr store with the correct variable shapes and attributes exists that -can be filled out by subsequent calls to ``to_zarr``. ``region`` can be -specified as ``"auto"``, which opens the existing store and determines the +can be filled out by subsequent calls to ``to_zarr``. +Setting ``region="auto"`` will open the existing store and determine the correct alignment of the new data with the existing coordinates, or as an explicit mapping from dimension names to Python ``slice`` objects indicating where the data should be written (in index space, not label space), e.g., From fc00adc1e55e5d0fcb12df2dd18bc7937f129f33 Mon Sep 17 00:00:00 2001 From: Deepak Cherian Date: Wed, 15 Nov 2023 10:33:23 -0700 Subject: [PATCH 2/2] [skip-ci] Whats new --- doc/whats-new.rst | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/doc/whats-new.rst b/doc/whats-new.rst index 5d0c30a1c2f..e0e83fb62c3 100644 --- a/doc/whats-new.rst +++ b/doc/whats-new.rst @@ -35,7 +35,7 @@ Breaking changes ~~~~~~~~~~~~~~~~ - drop support for `cdms2 `_. Please use `xcdat `_ instead (:pull:`8441`). - By `Justus Magin `_. - Bump minimum tested pint version to ``>=0.22``. By `Deepak Cherian `_. @@ -75,6 +75,8 @@ Bug fixes Documentation ~~~~~~~~~~~~~ +- Small updates to documentation on distributed writes: See :ref:`io.zarr.appending` to Zarr. + By `Deepak Cherian `_. Internal Changes