Skip to content

Commit a7c03e5

Browse files
committed
Some edits
1 parent b690248 commit a7c03e5

File tree

1 file changed

+27
-29
lines changed

1 file changed

+27
-29
lines changed

doc/user-guide/dask.rst

+27-29
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,8 @@ Here are some examples for using Xarray with Dask at scale:
2929
- `CMIP6 Precipitation Frequency Analysis <https://gallery.pangeo.io/repos/pangeo-gallery/cmip6/precip_frequency_change.html>`_
3030
- `Using Dask + Cloud Optimized GeoTIFFs <https://gallery.pangeo.io/repos/pangeo-data/landsat-8-tutorial-gallery/landsat8.html#Dask-Chunks-and-Cloud-Optimized-Geotiffs>`_
3131

32+
Find more examples at the `Project Pythia cookbook gallery <https://cookbooks.projectpythia.org/>`_.
33+
3234

3335
Using Dask with Xarray
3436
----------------------
@@ -38,11 +40,11 @@ Using Dask with Xarray
3840
:align: right
3941
:alt: A Dask array
4042

41-
Dask divides arrays into smaller parts called chunks. These chunks are small, manageable pieces of the larger dataset, that Dask is able to process in parallel (see the `Dask Array docs on chunks <https://docs.dask.org/en/stable/array-chunks.html?utm_source=xarray-docs>`_).
43+
Dask divides arrays into smaller parts called chunks. These chunks are small, manageable pieces of the larger dataset, that Dask is able to process in parallel (see the `Dask Array docs on chunks <https://docs.dask.org/en/stable/array-chunks.html?utm_source=xarray-docs>`_). Commonly chunks are set when reading data, but you can also set the chunksize manually at any point in your workflow using :py:meth:`Dataset.chunk` and :py:meth:`DataArray.chunk`. See :ref:`dask.chunks` for more.
4244

4345
Xarray operations on Dask-backed arrays are lazy. This means computations are not executed immediately, but are instead queued up as tasks in a Dask graph.
4446

45-
When a result is requested (e.g., for plotting, saving, or explicitly computing), Dask executes the task graph. The computations are carried out in parallel, with each chunk being processed independently. This parallel execution is key to handling large datasets efficiently.
47+
When a result is requested (e.g., for plotting, writing to disk, or explicitly computing), Dask executes the task graph. The computations are carried out in parallel, with each chunk being processed independently. This parallel execution is key to handling large datasets efficiently.
4648

4749
Nearly all Xarray methods have been extended to work automatically with Dask Arrays. This includes things like indexing, concatenating, rechunking, grouped operations, etc. Common operations are covered in more detail in each of the sections below.
4850

@@ -51,7 +53,7 @@ Nearly all Xarray methods have been extended to work automatically with Dask Arr
5153
Reading and writing data
5254
~~~~~~~~~~~~~~~~~~~~~~~~
5355

54-
When reading data, Dask divides your dataset into smaller chunks. You can specify the size of chunks with the ``chunks`` argument.
56+
When reading data, Dask divides your dataset into smaller chunks. You can specify the size of chunks with the ``chunks`` argument. Specifying ``chunks="auto"`` will set the dask chunk sizes to be a multiple of the on-disk chunk sizes. This can be a good idea, but usually the appropriate dask chunk size will depend on your workflow.
5557

5658
.. tab:: Zarr
5759

@@ -75,7 +77,7 @@ When reading data, Dask divides your dataset into smaller chunks. You can specif
7577

7678
.. tip::
7779

78-
When reading in many netCDF files with py:func:`~xarray.open_mfdataset`, using ``engine=h5netcdf`` can
80+
When reading in many netCDF files with py:func:`~xarray.open_mfdataset`, using ``engine="h5netcdf"`` can
7981
be faster than the default which uses the netCDF4 package.
8082

8183
Save larger-than-memory netCDF files::
@@ -144,17 +146,17 @@ There are a few common cases where you may want to convert lazy Dask arrays into
144146
- You've reduced the dataset (by filtering or with a groupby, for example) and now have something much smaller that fits in memory
145147
- You need to compute intermediate results since Dask is unable (or struggles) to perform a certain computation. The canonical example of this is normalizing a dataset, e.g., ``ds - ds.mean()``, when ``ds`` is larger than memory. Typically, you should either save ``ds`` to disk or compute ``ds.mean()`` eagerly.
146148

147-
To do this, you can use :py:meth:`~xarray.Dataset.compute`:
149+
To do this, you can use :py:meth:`Dataset.compute` or :py:meth:`DataArray.compute`:
148150

149151
.. ipython:: python
150152
151153
ds.compute()
152154
153155
.. note::
154156

155-
Using :py:meth:`~xarray.Dataset.compute` is preferred to :py:meth:`~xarray.Dataset.load`, which changes the results in-place.
157+
Using :py:meth:`Dataset.compute` is preferred to :py:meth:`Dataset.load`, which changes the results in-place.
156158

157-
You can also access :py:attr:`~xarray.DataArray.values`, which will always be a NumPy array:
159+
You can also access :py:attr:`DataArray.values`, which will always be a NumPy array:
158160

159161
.. ipython::
160162
:verbatim:
@@ -166,7 +168,7 @@ You can also access :py:attr:`~xarray.DataArray.values`, which will always be a
166168
...
167169
# truncated for brevity
168170

169-
NumPy ufuncs like ``np.sin`` transparently work on all xarray objects, including those
171+
NumPy ufuncs like :py:func:`numpy.sin` transparently work on all xarray objects, including those
170172
that store lazy Dask arrays:
171173

172174
.. ipython:: python
@@ -175,22 +177,18 @@ that store lazy Dask arrays:
175177
176178
np.sin(ds)
177179
178-
To access Dask arrays directly, use the
179-
:py:attr:`DataArray.data <xarray.DataArray.data>` attribute. This attribute exposes
180-
array data either as a Dask array or as a NumPy array, depending on whether it has been
181-
loaded into Dask or not.
182-
183-
.. note::
184-
185-
``.data`` is also used to expose other "computable" array backends beyond Dask and
186-
NumPy (e.g. sparse and pint arrays).
180+
To access Dask arrays directly, use the :py:attr:`DataArray.data` attribute which exposes the DataArray's underlying array type.
187181

188-
If you're using a Dask cluster, you can also use :py:meth:`~xarray.Dataset.persist` for quickly accessing intermediate outputs. This is most helpful after expensive operations like rechunking or setting an index. It's a way of telling the cluster that it should start executing the computations that you have defined so far, and that it should try to keep those results in memory. You will get back a new Dask array that is semantically equivalent to your old array, but now points to running data.
182+
If you're using a Dask cluster, you can also use :py:meth:`Dataset.persist` for quickly accessing intermediate outputs. This is most helpful after expensive operations like rechunking or setting an index. It's a way of telling the cluster that it should start executing the computations that you have defined so far, and that it should try to keep those results in memory. You will get back a new Dask array that is semantically equivalent to your old array, but now points to running data.
189183

190184
.. code-block:: python
191185
192186
ds = ds.persist()
193187
188+
.. tip::
189+
190+
Remember to save the dataset returned by persist! This is a common mistake.
191+
194192
.. _dask.chunks:
195193

196194
Chunking and performance
@@ -204,9 +202,10 @@ The way a dataset is chunked can be critical to performance when working with la
204202

205203
It can be helpful to choose chunk sizes based on your downstream analyses and to chunk as early as possible. Datasets with smaller chunks along the time axis, for example, can make time domain problems easier to parallelize since Dask can perform the same operation on each time chunk. If you're working with a large dataset with chunks that make downstream analyses challenging, you may need to rechunk your data. This is an expensive operation though, so is only recommended when needed.
206204

207-
You can rechunk a dataset by:
205+
You can chunk or rechunk a dataset by:
208206

209-
- Specifying ``chunks={}`` when reading in your dataset. If you know you'll want to do some spatial subsetting, for example, you could use ``chunks={'latitude': 10, 'longitude': 10}`` to specify small chunks across space. This can avoid loading subsets of data that span multiple chunks, thus reducing the number of file reads. Note that this will only work, though, for chunks that are similar to how the data is chunked on disk. Otherwise, it will be very slow and require a lot of network bandwidth.
207+
- Specifying the ``chunks`` kwarg when reading in your dataset. If you know you'll want to do some spatial subsetting, for example, you could use ``chunks={'latitude': 10, 'longitude': 10}`` to specify small chunks across space. This can avoid loading subsets of data that span multiple chunks, thus reducing the number of file reads. Note that this will only work, though, for chunks that are similar to how the data is chunked on disk. Otherwise, it will be very slow and require a lot of network bandwidth.
208+
- Many array file formats are chunked on disk. You can specify ``chunks={}`` to have a single dask chunk map to a single on-disk chunk, and ``chunks="auto"`` to have a single dask chunk be a automatically chosen multiple of the on-disk chunks.
210209
- Using :py:meth:`Dataset.chunk` after you've already read in your dataset. For time domain problems, for example, you can use ``ds.chunk(time=TimeResampler())`` to rechunk according to a specified unit of time. ``ds.chunk(time=TimeResampler("MS"))``, for example, will set the chunks so that a month of data is contained in one chunk.
211210

212211

@@ -224,7 +223,11 @@ each block of your xarray object, you have three options:
224223
1. Use :py:func:`~xarray.apply_ufunc` to apply functions that consume and return NumPy arrays.
225224
2. Use :py:func:`~xarray.map_blocks`, :py:meth:`Dataset.map_blocks` or :py:meth:`DataArray.map_blocks`
226225
to apply functions that consume and return xarray objects.
227-
3. Extract Dask Arrays from xarray objects with ``.data`` and use Dask directly.
226+
3. Extract Dask Arrays from xarray objects with :py:attr:`DataArray.data` and use Dask directly.
227+
228+
.. tip::
229+
230+
See the extensive Xarray tutorial on `apply_ufunc <https://tutorial.xarray.dev/advanced/apply_ufunc/apply_ufunc.html>`_.
228231

229232

230233
``apply_ufunc``
@@ -475,7 +478,6 @@ Here's an example of a simplified workflow putting some of these tips together:
475478

476479
.. code-block:: python
477480
478-
from flox.xarray import xarray_reduce
479481
import xarray
480482
481483
ds = xr.open_zarr( # Since we're doing a spatial reduction, increase chunk size in x, y
@@ -486,11 +488,7 @@ Here's an example of a simplified workflow putting some of these tips together:
486488
time=slice("2020-01-01", "2020-12-31") # Filter early
487489
)
488490
489-
zonal_mean = xarray_reduce( # Faster groupby with flox
490-
time_subset,
491-
chunked_zones,
492-
func="mean",
493-
expected_groups=(zone_labels,),
494-
)
491+
# faster resampling when flox is installed
492+
daily = ds.resample(time="D").mean()
495493
496-
zonal_mean.load() # Pull smaller results into memory after reducing the dataset
494+
daily.load() # Pull smaller results into memory after reducing the dataset

0 commit comments

Comments
 (0)