|
| 1 | +.. currentmodule:: xarray |
| 2 | + |
| 3 | +How to create a custom index |
| 4 | +============================ |
| 5 | + |
| 6 | +.. warning:: |
| 7 | + |
| 8 | + This feature is highly experimental. Support for custom indexes has been |
| 9 | + introduced in v2022.06.0 and is still incomplete. API is subject to change |
| 10 | + without deprecation notice. However we encourage you to experiment and report issues that arise. |
| 11 | + |
| 12 | +Xarray's built-in support for label-based indexing (e.g. `ds.sel(latitude=40, method="nearest")`) and alignment operations |
| 13 | +relies on :py:class:`pandas.Index` objects. Pandas Indexes are powerful and suitable for many |
| 14 | +applications but also have some limitations: |
| 15 | + |
| 16 | +- it only works with 1-dimensional coordinates where explicit labels |
| 17 | + are fully loaded in memory |
| 18 | +- it is hard to reuse it with irregular data for which there exist more |
| 19 | + efficient, tree-based structures to perform data selection |
| 20 | +- it doesn't support extra metadata that may be required for indexing and |
| 21 | + alignment (e.g., a coordinate reference system) |
| 22 | + |
| 23 | +Fortunately, Xarray now allows extending this functionality with custom indexes, |
| 24 | +which can be implemented in 3rd-party libraries. |
| 25 | + |
| 26 | +The Index base class |
| 27 | +-------------------- |
| 28 | + |
| 29 | +Every Xarray index must inherit from the :py:class:`Index` base class. It is for |
| 30 | +example the case of Xarray built-in ``PandasIndex`` and ``PandasMultiIndex`` |
| 31 | +subclasses, which wrap :py:class:`pandas.Index` and |
| 32 | +:py:class:`pandas.MultiIndex` respectively. |
| 33 | + |
| 34 | +The ``Index`` API closely follows the :py:class:`Dataset` and |
| 35 | +:py:class:`DataArray` API, e.g., for an index to support :py:meth:`DataArray.sel` it needs to |
| 36 | +implement :py:meth:`Index.sel`, to support :py:meth:`DataArray.stack` and :py:meth:`DataArray.unstack` it |
| 37 | +needs to implement :py:meth:`Index.stack` and :py:meth:`Index.unstack`, etc. |
| 38 | + |
| 39 | +Some guidelines and examples are given below. More details can be found in the |
| 40 | +documented :py:class:`Index` API. |
| 41 | + |
| 42 | +Minimal requirements |
| 43 | +-------------------- |
| 44 | + |
| 45 | +Every index must at least implement the :py:meth:`Index.from_variables` class |
| 46 | +method, which is used by Xarray to build a new index instance from one or more |
| 47 | +existing coordinates in a Dataset or DataArray. |
| 48 | + |
| 49 | +Since any collection of coordinates can be passed to that method (i.e., the |
| 50 | +number, order and dimensions of the coordinates are all arbitrary), it is the |
| 51 | +responsibility of the index to check the consistency and validity of those input |
| 52 | +coordinates. |
| 53 | + |
| 54 | +For example, :py:class:`~xarray.core.indexes.PandasIndex` accepts only one coordinate and |
| 55 | +:py:class:`~xarray.core.indexes.PandasMultiIndex` accepts one or more 1-dimensional coordinates that must all |
| 56 | +share the same dimension. Other, custom indexes need not have the same |
| 57 | +constraints, e.g., |
| 58 | + |
| 59 | +- a georeferenced raster index which only accepts two 1-d coordinates with |
| 60 | + distinct dimensions |
| 61 | +- a staggered grid index which takes coordinates with different dimension name |
| 62 | + suffixes (e.g., "_c" and "_l" for center and left) |
| 63 | + |
| 64 | +Optional requirements |
| 65 | +--------------------- |
| 66 | + |
| 67 | +Pretty much everything else is optional. Depending on the method, in the absence |
| 68 | +of a (re)implementation, an index will either raise a `NotImplementedError` |
| 69 | +or won't do anything specific (just drop, pass or copy itself |
| 70 | +from/to the resulting Dataset or DataArray). |
| 71 | + |
| 72 | +For example, you can just skip re-implementing :py:meth:`Index.rename` if there |
| 73 | +is no internal attribute or object to rename according to the new desired |
| 74 | +coordinate or dimension names. In the case of ``PandasIndex``, we rename the |
| 75 | +underlying ``pandas.Index`` object and/or update the ``PandasIndex.dim`` |
| 76 | +attribute since the associated dimension name has been changed. |
| 77 | + |
| 78 | +Wrap index data as coordinate data |
| 79 | +---------------------------------- |
| 80 | + |
| 81 | +In some cases it is possible to reuse the index's underlying object or structure |
| 82 | +as coordinate data and hence avoid data duplication. |
| 83 | + |
| 84 | +For ``PandasIndex`` and ``PandasMultiIndex``, we |
| 85 | +leverage the fact that ``pandas.Index`` objects expose some array-like API. In |
| 86 | +Xarray we use some wrappers around those underlying objects as a thin |
| 87 | +compatibility layer to preserve dtypes, handle explicit and n-dimensional |
| 88 | +indexing, etc. |
| 89 | + |
| 90 | +Other structures like tree-based indexes (e.g., kd-tree) may differ too much |
| 91 | +from arrays to reuse it as coordinate data. |
| 92 | + |
| 93 | +If the index data can be reused as coordinate data, the ``Index`` subclass |
| 94 | +should implement :py:meth:`Index.create_variables`. This method accepts a |
| 95 | +dictionary of variable names as keys and :py:class:`Variable` objects as values (used for propagating |
| 96 | +variable metadata) and should return a dictionary of new :py:class:`Variable` or |
| 97 | +:py:class:`IndexVariable` objects. |
| 98 | + |
| 99 | +Data selection |
| 100 | +-------------- |
| 101 | + |
| 102 | +For an index to support label-based selection, it needs to at least implement |
| 103 | +:py:meth:`Index.sel`. This method accepts a dictionary of labels where the keys |
| 104 | +are coordinate names (already filtered for the current index) and the values can |
| 105 | +be pretty much anything (e.g., a slice, a tuple, a list, a numpy array, a |
| 106 | +:py:class:`Variable` or a :py:class:`DataArray`). It is the responsibility of |
| 107 | +the index to properly handle those input labels. |
| 108 | + |
| 109 | +:py:meth:`Index.sel` must return an instance of :py:class:`IndexSelResult`. The |
| 110 | +latter is a small data class that holds positional indexers (indices) and that |
| 111 | +may also hold new variables, new indexes, names of variables or indexes to drop, |
| 112 | +names of dimensions to rename, etc. For example, this is useful in the case of |
| 113 | +``PandasMultiIndex`` as it allows Xarray to convert it into a single ``PandasIndex`` |
| 114 | +when only one level remains after the selection. |
| 115 | + |
| 116 | +The :py:class:`IndexSelResult` class is also used to merge results from label-based |
| 117 | +selection performed by different indexes. Note that it is now possible to have |
| 118 | +two distinct indexes for two 1-d coordinates sharing the same dimension, but it |
| 119 | +is not currently possible to use those two indexes in the same call to |
| 120 | +:py:meth:`Dataset.sel`. |
| 121 | + |
| 122 | +Optionally, the index may also implement :py:meth:`Index.isel`. In the case of |
| 123 | +``PandasIndex`` we use it to create a new index object by just indexing the |
| 124 | +underlying ``pandas.Index`` object. In other cases this may not be possible, |
| 125 | +e.g., a kd-tree object may not be easily indexed. If ``Index.isel()`` is not |
| 126 | +implemented, the index in just dropped in the DataArray or Dataset resulting |
| 127 | +from the selection. |
| 128 | + |
| 129 | +Alignment |
| 130 | +--------- |
| 131 | + |
| 132 | +For an index to support alignment, it needs to implement: |
| 133 | + |
| 134 | +- :py:meth:`Index.equals`, which compares the index with another index and |
| 135 | + returns either ``True`` or ``False`` |
| 136 | +- :py:meth:`Index.join`, which combines the index with another index and returns |
| 137 | + a new Index object |
| 138 | +- :py:meth:`Index.reindex_like`, which queries the index with another index and |
| 139 | + returns positional indexers that are used to re-index Dataset or DataArray |
| 140 | + variables along one or more dimensions |
| 141 | + |
| 142 | +Xarray ensures that those three methods are called with an index of the same |
| 143 | +type as argument. |
| 144 | + |
| 145 | +Meta-indexes |
| 146 | +------------ |
| 147 | + |
| 148 | +Nothing prevents writing a custom Xarray index that itself encapsulates other |
| 149 | +Xarray index(es). We call such index a "meta-index". |
| 150 | + |
| 151 | +Here is a small example of a meta-index for geospatial, raster datasets (i.e., |
| 152 | +regularly spaced 2-dimensional data) that internally relies on two |
| 153 | +``PandasIndex`` instances for the x and y dimensions respectively: |
| 154 | + |
| 155 | +.. code-block:: python |
| 156 | +
|
| 157 | + from xarray import Index |
| 158 | + from xarray.core.indexes import PandasIndex |
| 159 | + from xarray.core.indexing import merge_sel_results |
| 160 | +
|
| 161 | +
|
| 162 | + class RasterIndex(Index): |
| 163 | + def __init__(self, xy_indexes): |
| 164 | + assert len(xy_indexes) == 2 |
| 165 | +
|
| 166 | + # must have two distinct dimensions |
| 167 | + dim = [idx.dim for idx in xy_indexes.values()] |
| 168 | + assert dim[0] != dim[1] |
| 169 | +
|
| 170 | + self._xy_indexes = xy_indexes |
| 171 | +
|
| 172 | + @classmethod |
| 173 | + def from_variables(cls, variables): |
| 174 | + assert len(variables) == 2 |
| 175 | +
|
| 176 | + xy_indexes = { |
| 177 | + k: PandasIndex.from_variables({k: v}) for k, v in variables.items() |
| 178 | + } |
| 179 | +
|
| 180 | + return cls(xy_indexes) |
| 181 | +
|
| 182 | + def create_variables(self, variables): |
| 183 | + idx_variables = {} |
| 184 | +
|
| 185 | + for index in self._xy_indexes.values(): |
| 186 | + idx_variables.update(index.create_variables(variables)) |
| 187 | +
|
| 188 | + return idx_variables |
| 189 | +
|
| 190 | + def sel(self, labels): |
| 191 | + results = [] |
| 192 | +
|
| 193 | + for k, index in self._xy_indexes.items(): |
| 194 | + if k in labels: |
| 195 | + results.append(index.sel({k: labels[k]})) |
| 196 | +
|
| 197 | + return merge_sel_results(results) |
| 198 | +
|
| 199 | +
|
| 200 | +This basic index only supports label-based selection. Providing a full-featured |
| 201 | +index by implementing the other ``Index`` methods should be pretty |
| 202 | +straightforward for this example, though. |
| 203 | + |
| 204 | +This example is also not very useful unless we add some extra functionality on |
| 205 | +top of the two encapsulated ``PandasIndex`` objects, such as a coordinate |
| 206 | +reference system. |
| 207 | + |
| 208 | +How to use a custom index |
| 209 | +------------------------- |
| 210 | + |
| 211 | +You can use :py:meth:`Dataset.set_xindex` or :py:meth:`DataArray.set_xindex` to assign a |
| 212 | +custom index to a Dataset or DataArray, e.g., using the ``RasterIndex`` above: |
| 213 | + |
| 214 | +.. code-block:: python |
| 215 | +
|
| 216 | + import numpy as np |
| 217 | + import xarray as xr |
| 218 | +
|
| 219 | + da = xr.DataArray( |
| 220 | + np.random.uniform(size=(100, 50)), |
| 221 | + coords={"x": ("x", np.arange(50)), "y": ("y", np.arange(100))}, |
| 222 | + dims=("y", "x"), |
| 223 | + ) |
| 224 | +
|
| 225 | + # Xarray create default indexes for the 'x' and 'y' coordinates |
| 226 | + # we first need to explicitly drop it |
| 227 | + da = da.drop_indexes(["x", "y"]) |
| 228 | +
|
| 229 | + # Build a RasterIndex from the 'x' and 'y' coordinates |
| 230 | + da_raster = da.set_xindex(["x", "y"], RasterIndex) |
| 231 | +
|
| 232 | + # RasterIndex now takes care of label-based selection |
| 233 | + selected = da_raster.sel(x=10, y=slice(20, 50)) |
0 commit comments