You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Find more examples at the `Project Pythia cookbook gallery <https://cookbooks.projectpythia.org/>`_.
33
+
32
34
33
35
Using Dask with Xarray
34
36
----------------------
@@ -38,11 +40,11 @@ Using Dask with Xarray
38
40
:align:right
39
41
:alt:A Dask array
40
42
41
-
Dask divides arrays into smaller parts called chunks. These chunks are small, manageable pieces of the larger dataset, that Dask is able to process in parallel (see the `Dask Array docs on chunks <https://docs.dask.org/en/stable/array-chunks.html?utm_source=xarray-docs>`_).
43
+
Dask divides arrays into smaller parts called chunks. These chunks are small, manageable pieces of the larger dataset, that Dask is able to process in parallel (see the `Dask Array docs on chunks <https://docs.dask.org/en/stable/array-chunks.html?utm_source=xarray-docs>`_). Commonly chunks are set when reading data, but you can also set the chunksize manually at any point in your workflow using :py:meth:`Dataset.chunk` and :py:meth:`DataArray.chunk`. See :ref:`dask.chunks` for more.
42
44
43
45
Xarray operations on Dask-backed arrays are lazy. This means computations are not executed immediately, but are instead queued up as tasks in a Dask graph.
44
46
45
-
When a result is requested (e.g., for plotting, saving, or explicitly computing), Dask executes the task graph. The computations are carried out in parallel, with each chunk being processed independently. This parallel execution is key to handling large datasets efficiently.
47
+
When a result is requested (e.g., for plotting, writing to disk, or explicitly computing), Dask executes the task graph. The computations are carried out in parallel, with each chunk being processed independently. This parallel execution is key to handling large datasets efficiently.
46
48
47
49
Nearly all Xarray methods have been extended to work automatically with Dask Arrays. This includes things like indexing, concatenating, rechunking, grouped operations, etc. Common operations are covered in more detail in each of the sections below.
48
50
@@ -51,7 +53,7 @@ Nearly all Xarray methods have been extended to work automatically with Dask Arr
51
53
Reading and writing data
52
54
~~~~~~~~~~~~~~~~~~~~~~~~
53
55
54
-
When reading data, Dask divides your dataset into smaller chunks. You can specify the size of chunks with the ``chunks`` argument.
56
+
When reading data, Dask divides your dataset into smaller chunks. You can specify the size of chunks with the ``chunks`` argument. Specifying ``chunks="auto"`` will set the dask chunk sizes to be a multiple of the on-disk chunk sizes. This can be a good idea, but usually the appropriate dask chunk size will depend on your workflow.
55
57
56
58
.. tab:: Zarr
57
59
@@ -75,7 +77,7 @@ When reading data, Dask divides your dataset into smaller chunks. You can specif
75
77
76
78
.. tip::
77
79
78
-
When reading in many netCDF files with py:func:`~xarray.open_mfdataset`, using ``engine=h5netcdf`` can
80
+
When reading in many netCDF files with py:func:`~xarray.open_mfdataset`, using ``engine="h5netcdf"`` can
79
81
be faster than the default which uses the netCDF4 package.
80
82
81
83
Save larger-than-memory netCDF files::
@@ -144,17 +146,17 @@ There are a few common cases where you may want to convert lazy Dask arrays into
144
146
- You've reduced the dataset (by filtering or with a groupby, for example) and now have something much smaller that fits in memory
145
147
- You need to compute intermediate results since Dask is unable (or struggles) to perform a certain computation. The canonical example of this is normalizing a dataset, e.g., ``ds - ds.mean()``, when ``ds`` is larger than memory. Typically, you should either save ``ds`` to disk or compute ``ds.mean()`` eagerly.
146
148
147
-
To do this, you can use :py:meth:`~xarray.Dataset.compute`:
149
+
To do this, you can use :py:meth:`Dataset.compute` or :py:meth:`DataArray.compute`:
148
150
149
151
.. ipython:: python
150
152
151
153
ds.compute()
152
154
153
155
.. note::
154
156
155
-
Using :py:meth:`~xarray.Dataset.compute` is preferred to :py:meth:`~xarray.Dataset.load`, which changes the results in-place.
157
+
Using :py:meth:`Dataset.compute` is preferred to :py:meth:`Dataset.load`, which changes the results in-place.
156
158
157
-
You can also access :py:attr:`~xarray.DataArray.values`, which will always be a NumPy array:
159
+
You can also access :py:attr:`DataArray.values`, which will always be a NumPy array:
158
160
159
161
.. ipython::
160
162
:verbatim:
@@ -166,7 +168,7 @@ You can also access :py:attr:`~xarray.DataArray.values`, which will always be a
166
168
...
167
169
# truncated for brevity
168
170
169
-
NumPy ufuncs like ``np.sin`` transparently work on all xarray objects, including those
171
+
NumPy ufuncs like :py:func:`numpy.sin` transparently work on all xarray objects, including those
170
172
that store lazy Dask arrays:
171
173
172
174
.. ipython:: python
@@ -175,22 +177,18 @@ that store lazy Dask arrays:
175
177
176
178
np.sin(ds)
177
179
178
-
To access Dask arrays directly, use the
179
-
:py:attr:`DataArray.data <xarray.DataArray.data>` attribute. This attribute exposes
180
-
array data either as a Dask array or as a NumPy array, depending on whether it has been
181
-
loaded into Dask or not.
182
-
183
-
.. note::
184
-
185
-
``.data`` is also used to expose other "computable" array backends beyond Dask and
186
-
NumPy (e.g. sparse and pint arrays).
180
+
To access Dask arrays directly, use the :py:attr:`DataArray.data` attribute which exposes the DataArray's underlying array type.
187
181
188
-
If you're using a Dask cluster, you can also use :py:meth:`~xarray.Dataset.persist` for quickly accessing intermediate outputs. This is most helpful after expensive operations like rechunking or setting an index. It's a way of telling the cluster that it should start executing the computations that you have defined so far, and that it should try to keep those results in memory. You will get back a new Dask array that is semantically equivalent to your old array, but now points to running data.
182
+
If you're using a Dask cluster, you can also use :py:meth:`Dataset.persist` for quickly accessing intermediate outputs. This is most helpful after expensive operations like rechunking or setting an index. It's a way of telling the cluster that it should start executing the computations that you have defined so far, and that it should try to keep those results in memory. You will get back a new Dask array that is semantically equivalent to your old array, but now points to running data.
189
183
190
184
.. code-block:: python
191
185
192
186
ds = ds.persist()
193
187
188
+
.. tip::
189
+
190
+
Remember to save the dataset returned by persist! This is a common mistake.
191
+
194
192
.. _dask.chunks:
195
193
196
194
Chunking and performance
@@ -204,9 +202,10 @@ The way a dataset is chunked can be critical to performance when working with la
204
202
205
203
It can be helpful to choose chunk sizes based on your downstream analyses and to chunk as early as possible. Datasets with smaller chunks along the time axis, for example, can make time domain problems easier to parallelize since Dask can perform the same operation on each time chunk. If you're working with a large dataset with chunks that make downstream analyses challenging, you may need to rechunk your data. This is an expensive operation though, so is only recommended when needed.
206
204
207
-
You can rechunk a dataset by:
205
+
You can chunk or rechunk a dataset by:
208
206
209
-
- Specifying ``chunks={}`` when reading in your dataset. If you know you'll want to do some spatial subsetting, for example, you could use ``chunks={'latitude': 10, 'longitude': 10}`` to specify small chunks across space. This can avoid loading subsets of data that span multiple chunks, thus reducing the number of file reads. Note that this will only work, though, for chunks that are similar to how the data is chunked on disk. Otherwise, it will be very slow and require a lot of network bandwidth.
207
+
- Specifying the ``chunks`` kwarg when reading in your dataset. If you know you'll want to do some spatial subsetting, for example, you could use ``chunks={'latitude': 10, 'longitude': 10}`` to specify small chunks across space. This can avoid loading subsets of data that span multiple chunks, thus reducing the number of file reads. Note that this will only work, though, for chunks that are similar to how the data is chunked on disk. Otherwise, it will be very slow and require a lot of network bandwidth.
208
+
- Many array file formats are chunked on disk. You can specify ``chunks={}`` to have a single dask chunk map to a single on-disk chunk, and ``chunks="auto"`` to have a single dask chunk be a automatically chosen multiple of the on-disk chunks.
210
209
- Using :py:meth:`Dataset.chunk` after you've already read in your dataset. For time domain problems, for example, you can use ``ds.chunk(time=TimeResampler())`` to rechunk according to a specified unit of time. ``ds.chunk(time=TimeResampler("MS"))``, for example, will set the chunks so that a month of data is contained in one chunk.
211
210
212
211
@@ -224,7 +223,11 @@ each block of your xarray object, you have three options:
224
223
1. Use :py:func:`~xarray.apply_ufunc` to apply functions that consume and return NumPy arrays.
225
224
2. Use :py:func:`~xarray.map_blocks`, :py:meth:`Dataset.map_blocks` or :py:meth:`DataArray.map_blocks`
226
225
to apply functions that consume and return xarray objects.
227
-
3. Extract Dask Arrays from xarray objects with ``.data`` and use Dask directly.
226
+
3. Extract Dask Arrays from xarray objects with :py:attr:`DataArray.data` and use Dask directly.
227
+
228
+
.. tip::
229
+
230
+
See the extensive Xarray tutorial on `apply_ufunc <https://tutorial.xarray.dev/advanced/apply_ufunc/apply_ufunc.html>`_.
228
231
229
232
230
233
``apply_ufunc``
@@ -475,7 +478,6 @@ Here's an example of a simplified workflow putting some of these tips together:
475
478
476
479
.. code-block:: python
477
480
478
-
from flox.xarray import xarray_reduce
479
481
import xarray
480
482
481
483
ds = xr.open_zarr( # Since we're doing a spatial reduction, increase chunk size in x, y
@@ -486,11 +488,7 @@ Here's an example of a simplified workflow putting some of these tips together:
486
488
time=slice("2020-01-01", "2020-12-31") # Filter early
487
489
)
488
490
489
-
zonal_mean = xarray_reduce( # Faster groupby with flox
490
-
time_subset,
491
-
chunked_zones,
492
-
func="mean",
493
-
expected_groups=(zone_labels,),
494
-
)
491
+
# faster resampling when flox is installed
492
+
daily = ds.resample(time="D").mean()
495
493
496
-
zonal_mean.load() # Pull smaller results into memory after reducing the dataset
494
+
daily.load() # Pull smaller results into memory after reducing the dataset
0 commit comments