Skip to content

Commit 7234603

Browse files
benbovyTomNicholaspre-commit-ci[bot]dcherian
authored
Add documentation on custom indexes (#6975)
* improve Index base class type annotations Use T_Index generic when possible. * import Index base class in Xarray root namespace * import IndexSelResult into Xarray root namespace * wip: Index API docstrings * wip: doc: add how to add custom index section * add Index method docstrings * add user guide on how to create a custom index * review comments + tweaks * update what's new * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply uncontroversial suggestions from Deepak's code review Co-authored-by: Deepak Cherian <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply more suggestions from code review Co-authored-by: Deepak Cherian <[email protected]> * Link to source code for PandasIndex and PandasMultiIndex --------- Co-authored-by: Thomas Nicholas <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Deepak Cherian <[email protected]>
1 parent 647376d commit 7234603

File tree

7 files changed

+565
-24
lines changed

7 files changed

+565
-24
lines changed

doc/api-hidden.rst

+15
Original file line numberDiff line numberDiff line change
@@ -451,6 +451,21 @@
451451
CFTimeIndex.values
452452
CFTimeIndex.year
453453

454+
Index.from_variables
455+
Index.concat
456+
Index.stack
457+
Index.unstack
458+
Index.create_variables
459+
Index.to_pandas_index
460+
Index.isel
461+
Index.sel
462+
Index.join
463+
Index.reindex_like
464+
Index.equals
465+
Index.roll
466+
Index.rename
467+
Index.copy
468+
454469
backends.NetCDF4DataStore.close
455470
backends.NetCDF4DataStore.encode
456471
backends.NetCDF4DataStore.encode_attribute

doc/api.rst

+2-1
Original file line numberDiff line numberDiff line change
@@ -1090,7 +1090,8 @@ Advanced API
10901090
Variable
10911091
IndexVariable
10921092
as_variable
1093-
indexes.Index
1093+
Index
1094+
IndexSelResult
10941095
Context
10951096
register_dataset_accessor
10961097
register_dataarray_accessor
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,233 @@
1+
.. currentmodule:: xarray
2+
3+
How to create a custom index
4+
============================
5+
6+
.. warning::
7+
8+
This feature is highly experimental. Support for custom indexes has been
9+
introduced in v2022.06.0 and is still incomplete. API is subject to change
10+
without deprecation notice. However we encourage you to experiment and report issues that arise.
11+
12+
Xarray's built-in support for label-based indexing (e.g. `ds.sel(latitude=40, method="nearest")`) and alignment operations
13+
relies on :py:class:`pandas.Index` objects. Pandas Indexes are powerful and suitable for many
14+
applications but also have some limitations:
15+
16+
- it only works with 1-dimensional coordinates where explicit labels
17+
are fully loaded in memory
18+
- it is hard to reuse it with irregular data for which there exist more
19+
efficient, tree-based structures to perform data selection
20+
- it doesn't support extra metadata that may be required for indexing and
21+
alignment (e.g., a coordinate reference system)
22+
23+
Fortunately, Xarray now allows extending this functionality with custom indexes,
24+
which can be implemented in 3rd-party libraries.
25+
26+
The Index base class
27+
--------------------
28+
29+
Every Xarray index must inherit from the :py:class:`Index` base class. It is for
30+
example the case of Xarray built-in ``PandasIndex`` and ``PandasMultiIndex``
31+
subclasses, which wrap :py:class:`pandas.Index` and
32+
:py:class:`pandas.MultiIndex` respectively.
33+
34+
The ``Index`` API closely follows the :py:class:`Dataset` and
35+
:py:class:`DataArray` API, e.g., for an index to support :py:meth:`DataArray.sel` it needs to
36+
implement :py:meth:`Index.sel`, to support :py:meth:`DataArray.stack` and :py:meth:`DataArray.unstack` it
37+
needs to implement :py:meth:`Index.stack` and :py:meth:`Index.unstack`, etc.
38+
39+
Some guidelines and examples are given below. More details can be found in the
40+
documented :py:class:`Index` API.
41+
42+
Minimal requirements
43+
--------------------
44+
45+
Every index must at least implement the :py:meth:`Index.from_variables` class
46+
method, which is used by Xarray to build a new index instance from one or more
47+
existing coordinates in a Dataset or DataArray.
48+
49+
Since any collection of coordinates can be passed to that method (i.e., the
50+
number, order and dimensions of the coordinates are all arbitrary), it is the
51+
responsibility of the index to check the consistency and validity of those input
52+
coordinates.
53+
54+
For example, :py:class:`~xarray.core.indexes.PandasIndex` accepts only one coordinate and
55+
:py:class:`~xarray.core.indexes.PandasMultiIndex` accepts one or more 1-dimensional coordinates that must all
56+
share the same dimension. Other, custom indexes need not have the same
57+
constraints, e.g.,
58+
59+
- a georeferenced raster index which only accepts two 1-d coordinates with
60+
distinct dimensions
61+
- a staggered grid index which takes coordinates with different dimension name
62+
suffixes (e.g., "_c" and "_l" for center and left)
63+
64+
Optional requirements
65+
---------------------
66+
67+
Pretty much everything else is optional. Depending on the method, in the absence
68+
of a (re)implementation, an index will either raise a `NotImplementedError`
69+
or won't do anything specific (just drop, pass or copy itself
70+
from/to the resulting Dataset or DataArray).
71+
72+
For example, you can just skip re-implementing :py:meth:`Index.rename` if there
73+
is no internal attribute or object to rename according to the new desired
74+
coordinate or dimension names. In the case of ``PandasIndex``, we rename the
75+
underlying ``pandas.Index`` object and/or update the ``PandasIndex.dim``
76+
attribute since the associated dimension name has been changed.
77+
78+
Wrap index data as coordinate data
79+
----------------------------------
80+
81+
In some cases it is possible to reuse the index's underlying object or structure
82+
as coordinate data and hence avoid data duplication.
83+
84+
For ``PandasIndex`` and ``PandasMultiIndex``, we
85+
leverage the fact that ``pandas.Index`` objects expose some array-like API. In
86+
Xarray we use some wrappers around those underlying objects as a thin
87+
compatibility layer to preserve dtypes, handle explicit and n-dimensional
88+
indexing, etc.
89+
90+
Other structures like tree-based indexes (e.g., kd-tree) may differ too much
91+
from arrays to reuse it as coordinate data.
92+
93+
If the index data can be reused as coordinate data, the ``Index`` subclass
94+
should implement :py:meth:`Index.create_variables`. This method accepts a
95+
dictionary of variable names as keys and :py:class:`Variable` objects as values (used for propagating
96+
variable metadata) and should return a dictionary of new :py:class:`Variable` or
97+
:py:class:`IndexVariable` objects.
98+
99+
Data selection
100+
--------------
101+
102+
For an index to support label-based selection, it needs to at least implement
103+
:py:meth:`Index.sel`. This method accepts a dictionary of labels where the keys
104+
are coordinate names (already filtered for the current index) and the values can
105+
be pretty much anything (e.g., a slice, a tuple, a list, a numpy array, a
106+
:py:class:`Variable` or a :py:class:`DataArray`). It is the responsibility of
107+
the index to properly handle those input labels.
108+
109+
:py:meth:`Index.sel` must return an instance of :py:class:`IndexSelResult`. The
110+
latter is a small data class that holds positional indexers (indices) and that
111+
may also hold new variables, new indexes, names of variables or indexes to drop,
112+
names of dimensions to rename, etc. For example, this is useful in the case of
113+
``PandasMultiIndex`` as it allows Xarray to convert it into a single ``PandasIndex``
114+
when only one level remains after the selection.
115+
116+
The :py:class:`IndexSelResult` class is also used to merge results from label-based
117+
selection performed by different indexes. Note that it is now possible to have
118+
two distinct indexes for two 1-d coordinates sharing the same dimension, but it
119+
is not currently possible to use those two indexes in the same call to
120+
:py:meth:`Dataset.sel`.
121+
122+
Optionally, the index may also implement :py:meth:`Index.isel`. In the case of
123+
``PandasIndex`` we use it to create a new index object by just indexing the
124+
underlying ``pandas.Index`` object. In other cases this may not be possible,
125+
e.g., a kd-tree object may not be easily indexed. If ``Index.isel()`` is not
126+
implemented, the index in just dropped in the DataArray or Dataset resulting
127+
from the selection.
128+
129+
Alignment
130+
---------
131+
132+
For an index to support alignment, it needs to implement:
133+
134+
- :py:meth:`Index.equals`, which compares the index with another index and
135+
returns either ``True`` or ``False``
136+
- :py:meth:`Index.join`, which combines the index with another index and returns
137+
a new Index object
138+
- :py:meth:`Index.reindex_like`, which queries the index with another index and
139+
returns positional indexers that are used to re-index Dataset or DataArray
140+
variables along one or more dimensions
141+
142+
Xarray ensures that those three methods are called with an index of the same
143+
type as argument.
144+
145+
Meta-indexes
146+
------------
147+
148+
Nothing prevents writing a custom Xarray index that itself encapsulates other
149+
Xarray index(es). We call such index a "meta-index".
150+
151+
Here is a small example of a meta-index for geospatial, raster datasets (i.e.,
152+
regularly spaced 2-dimensional data) that internally relies on two
153+
``PandasIndex`` instances for the x and y dimensions respectively:
154+
155+
.. code-block:: python
156+
157+
from xarray import Index
158+
from xarray.core.indexes import PandasIndex
159+
from xarray.core.indexing import merge_sel_results
160+
161+
162+
class RasterIndex(Index):
163+
def __init__(self, xy_indexes):
164+
assert len(xy_indexes) == 2
165+
166+
# must have two distinct dimensions
167+
dim = [idx.dim for idx in xy_indexes.values()]
168+
assert dim[0] != dim[1]
169+
170+
self._xy_indexes = xy_indexes
171+
172+
@classmethod
173+
def from_variables(cls, variables):
174+
assert len(variables) == 2
175+
176+
xy_indexes = {
177+
k: PandasIndex.from_variables({k: v}) for k, v in variables.items()
178+
}
179+
180+
return cls(xy_indexes)
181+
182+
def create_variables(self, variables):
183+
idx_variables = {}
184+
185+
for index in self._xy_indexes.values():
186+
idx_variables.update(index.create_variables(variables))
187+
188+
return idx_variables
189+
190+
def sel(self, labels):
191+
results = []
192+
193+
for k, index in self._xy_indexes.items():
194+
if k in labels:
195+
results.append(index.sel({k: labels[k]}))
196+
197+
return merge_sel_results(results)
198+
199+
200+
This basic index only supports label-based selection. Providing a full-featured
201+
index by implementing the other ``Index`` methods should be pretty
202+
straightforward for this example, though.
203+
204+
This example is also not very useful unless we add some extra functionality on
205+
top of the two encapsulated ``PandasIndex`` objects, such as a coordinate
206+
reference system.
207+
208+
How to use a custom index
209+
-------------------------
210+
211+
You can use :py:meth:`Dataset.set_xindex` or :py:meth:`DataArray.set_xindex` to assign a
212+
custom index to a Dataset or DataArray, e.g., using the ``RasterIndex`` above:
213+
214+
.. code-block:: python
215+
216+
import numpy as np
217+
import xarray as xr
218+
219+
da = xr.DataArray(
220+
np.random.uniform(size=(100, 50)),
221+
coords={"x": ("x", np.arange(50)), "y": ("y", np.arange(100))},
222+
dims=("y", "x"),
223+
)
224+
225+
# Xarray create default indexes for the 'x' and 'y' coordinates
226+
# we first need to explicitly drop it
227+
da = da.drop_indexes(["x", "y"])
228+
229+
# Build a RasterIndex from the 'x' and 'y' coordinates
230+
da_raster = da.set_xindex(["x", "y"], RasterIndex)
231+
232+
# RasterIndex now takes care of label-based selection
233+
selected = da_raster.sel(x=10, y=slice(20, 50))

doc/internals/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -25,3 +25,4 @@ The pages in this section are intended for:
2525
extending-xarray
2626
zarr-encoding-spec
2727
how-to-add-new-backend
28+
how-to-create-custom-index

doc/whats-new.rst

+4
Original file line numberDiff line numberDiff line change
@@ -795,6 +795,10 @@ Bug fixes
795795

796796
Documentation
797797
~~~~~~~~~~~~~
798+
799+
- Add docstrings for the :py:class:`Index` base class and add some documentation on how to
800+
create custom, Xarray-compatible indexes (:pull:`6975`)
801+
By `Benoît Bovy <https://github.com/benbovy>`_.
798802
- Update merge docstrings. (:issue:`6935`, :pull:`7033`)
799803
By `Zach Moon <https://github.com/zmoon>`_.
800804
- Raise a more informative error when trying to open a non-existent zarr store. (:issue:`6484`, :pull:`7060`)

xarray/__init__.py

+4
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,8 @@
3232
register_dataarray_accessor,
3333
register_dataset_accessor,
3434
)
35+
from xarray.core.indexes import Index
36+
from xarray.core.indexing import IndexSelResult
3537
from xarray.core.merge import Context, MergeError, merge
3638
from xarray.core.options import get_options, set_options
3739
from xarray.core.parallel import map_blocks
@@ -100,6 +102,8 @@
100102
"Coordinate",
101103
"DataArray",
102104
"Dataset",
105+
"Index",
106+
"IndexSelResult",
103107
"IndexVariable",
104108
"Variable",
105109
# Exceptions

0 commit comments

Comments
 (0)