|
| 1 | +.. _internal design: |
| 2 | + |
| 3 | +Internal Design |
| 4 | +=============== |
| 5 | + |
| 6 | +This page gives an overview of the internal design of xarray. |
| 7 | + |
| 8 | +In totality, the Xarray project defines 4 key data structures. |
| 9 | +In order of increasing complexity, they are: |
| 10 | + |
| 11 | +- :py:class:`xarray.Variable`, |
| 12 | +- :py:class:`xarray.DataArray`, |
| 13 | +- :py:class:`xarray.Dataset`, |
| 14 | +- :py:class:`datatree.DataTree`. |
| 15 | + |
| 16 | +The user guide lists only :py:class:`xarray.DataArray` and :py:class:`xarray.Dataset`, |
| 17 | +but :py:class:`~xarray.Variable` is the fundamental object internally, |
| 18 | +and :py:class:`~datatree.DataTree` is a natural generalisation of :py:class:`xarray.Dataset`. |
| 19 | + |
| 20 | +.. note:: |
| 21 | + |
| 22 | + Our :ref:`roadmap` includes plans both to document :py:class:`~xarray.Variable` as fully public API, |
| 23 | + and to merge the `xarray-datatree <https://github.com/xarray-contrib/datatree>`_ package into xarray's main repository. |
| 24 | + |
| 25 | +Internally private :ref:`lazy indexing classes <internal design.lazy indexing>` are used to avoid loading more data than necessary, |
| 26 | +and flexible indexes classes (derived from :py:class:`~xarray.indexes.Index`) provide performant label-based lookups. |
| 27 | + |
| 28 | + |
| 29 | +.. _internal design.data structures: |
| 30 | + |
| 31 | +Data Structures |
| 32 | +--------------- |
| 33 | + |
| 34 | +The :ref:`data structures` page in the user guide explains the basics and concentrates on user-facing behavior, |
| 35 | +whereas this section explains how xarray's data structure classes actually work internally. |
| 36 | + |
| 37 | + |
| 38 | +.. _internal design.data structures.variable: |
| 39 | + |
| 40 | +Variable Objects |
| 41 | +~~~~~~~~~~~~~~~~ |
| 42 | + |
| 43 | +The core internal data structure in xarray is the :py:class:`~xarray.Variable`, |
| 44 | +which is used as the basic building block behind xarray's |
| 45 | +:py:class:`~xarray.Dataset`, :py:class:`~xarray.DataArray` types. A |
| 46 | +:py:class:`~xarray.Variable` consists of: |
| 47 | + |
| 48 | +- ``dims``: A tuple of dimension names. |
| 49 | +- ``data``: The N-dimensional array (typically a NumPy or Dask array) storing |
| 50 | + the Variable's data. It must have the same number of dimensions as the length |
| 51 | + of ``dims``. |
| 52 | +- ``attrs``: An ordered dictionary of metadata associated with this array. By |
| 53 | + convention, xarray's built-in operations never use this metadata. |
| 54 | +- ``encoding``: Another ordered dictionary used to store information about how |
| 55 | + these variable's data is represented on disk. See :ref:`io.encoding` for more |
| 56 | + details. |
| 57 | + |
| 58 | +:py:class:`~xarray.Variable` has an interface similar to NumPy arrays, but extended to make use |
| 59 | +of named dimensions. For example, it uses ``dim`` in preference to an ``axis`` |
| 60 | +argument for methods like ``mean``, and supports :ref:`compute.broadcasting`. |
| 61 | + |
| 62 | +However, unlike ``Dataset`` and ``DataArray``, the basic ``Variable`` does not |
| 63 | +include coordinate labels along each axis. |
| 64 | + |
| 65 | +:py:class:`~xarray.Variable` is public API, but because of its incomplete support for labeled |
| 66 | +data, it is mostly intended for advanced uses, such as in xarray itself, for |
| 67 | +writing new backends, or when creating custom indexes. |
| 68 | +You can access the variable objects that correspond to xarray objects via the (readonly) |
| 69 | +:py:attr:`Dataset.variables <xarray.Dataset.variables>` and |
| 70 | +:py:attr:`DataArray.variable <xarray.DataArray.variable>` attributes. |
| 71 | + |
| 72 | + |
| 73 | +.. _internal design.dataarray: |
| 74 | + |
| 75 | +DataArray Objects |
| 76 | +~~~~~~~~~~~~~~~~~ |
| 77 | + |
| 78 | +The simplest data structure used by most users is :py:class:`~xarray.DataArray`. |
| 79 | +A :py:class:`~xarray.DataArray` is a composite object consisting of multiple |
| 80 | +:py:class:`~xarray.core.variable.Variable` objects which store related data. |
| 81 | + |
| 82 | +A single :py:class:`~xarray.core.Variable` is referred to as the "data variable", and stored under the :py:attr:`~xarray.DataArray.variable`` attribute. |
| 83 | +A :py:class:`~xarray.DataArray` inherits all of the properties of this data variable, i.e. ``dims``, ``data``, ``attrs`` and ``encoding``, |
| 84 | +all of which are implemented by forwarding on to the underlying ``Variable`` object. |
| 85 | + |
| 86 | +In addition, a :py:class:`~xarray.DataArray` stores additional ``Variable`` objects stored in a dict under the private ``_coords`` attribute, |
| 87 | +each of which is referred to as a "Coordinate Variable". These coordinate variable objects are only allowed to have ``dims`` that are a subset of the data variable's ``dims``, |
| 88 | +and each dim has a specific length. This means that the full :py:attr:`~xarray.DataArray.size` of the dataarray can be represented by a dictionary mapping dimension names to integer sizes. |
| 89 | +The underlying data variable has this exact same size, and the attached coordinate variables have sizes which are some subset of the size of the data variable. |
| 90 | +Another way of saying this is that all coordinate variables must be "alignable" with the data variable. |
| 91 | + |
| 92 | +When a coordinate is accessed by the user (e.g. via the dict-like :py:class:`~xarray.DataArray.__getitem__` syntax), |
| 93 | +then a new ``DataArray`` is constructed by finding all coordinate variables that have compatible dimensions and re-attaching them before the result is returned. |
| 94 | +This is why most users never see the ``Variable`` class underlying each coordinate variable - it is always promoted to a ``DataArray`` before returning. |
| 95 | + |
| 96 | +Lookups are performed by special :py:class:`~xarray.indexes.Index` objects, which are stored in a dict under the private ``_indexes`` attribute. |
| 97 | +Indexes must be associated with one or more coordinates, and essentially act by translating a query given in physical coordinate space |
| 98 | +(typically via the :py:meth:`~xarray.DataArray.sel` method) into a set of integer indices in array index space that can be used to index the underlying n-dimensional array-like ``data``. |
| 99 | +Indexing in array index space (typically performed via the :py:meth:`~xarray.DataArray.sel` method) does not require consulting an ``Index`` object. |
| 100 | + |
| 101 | +Finally a :py:class:`~xarray.DataArray` defines a :py:attr:`~xarray.DataArray.name` attribute, which refers to its data |
| 102 | +variable but is stored on the wrapping ``DataArray`` class. |
| 103 | +The ``name`` attribute is primarily used when one or more :py:class:`~xarray.DataArray` objects are promoted into a :py:class:`~xarray.Dataset` |
| 104 | +(e.g. via :py:meth:`~xarray.DataArray.to_dataset`). |
| 105 | +Note that the underlying :py:class:`~xarray.core.Variable` objects are all unnamed, so they can always be referred to uniquely via a |
| 106 | +dict-like mapping. |
| 107 | + |
| 108 | +.. _internal design.dataset: |
| 109 | + |
| 110 | +Dataset Objects |
| 111 | +~~~~~~~~~~~~~~~ |
| 112 | + |
| 113 | +The :py:class:`~xarray.Dataset` class is a generalization of the :py:class:`~xarray.DataArray` class that can hold multiple data variables. |
| 114 | +Internally all data variables and coordinate variables are stored under a single ``variables`` dict, and coordinates are |
| 115 | +specified by storing their names in a private ``_coord_names`` dict. |
| 116 | + |
| 117 | +The dataset's ``dims`` are the set of all dims present across any variable, but (similar to in dataarrays) coordinate |
| 118 | +variables cannot have a dimension that is not present on any data variable. |
| 119 | + |
| 120 | +When a data variable or coordinate variable is accessed, a new ``DataArray`` is again constructed from all compatible |
| 121 | +coordinates before returning. |
| 122 | + |
| 123 | +.. _internal design.subclassing: |
| 124 | + |
| 125 | +.. note:: |
| 126 | + |
| 127 | + The way that selecting a variable from a ``DataArray`` or ``Dataset`` actually involves internally wrapping the |
| 128 | + ``Variable`` object back up into a ``DataArray``/``Dataset`` is the primary reason :ref:`we recommend against subclassing <internals.accessors.composition>` |
| 129 | + Xarray objects. The main problem it creates is that we currently cannot easily guarantee that for example selecting |
| 130 | + a coordinate variable from your ``SubclassedDataArray`` would return an instance of ``SubclassedDataArray`` instead |
| 131 | + of just an :py:class:`xarray.DataArray`. See `GH issue <https://github.com/pydata/xarray/issues/3980>`_ for more details. |
| 132 | + |
| 133 | +.. _internal design.lazy indexing: |
| 134 | + |
| 135 | +Lazy Indexing Classes |
| 136 | +--------------------- |
| 137 | + |
| 138 | +TODO |
0 commit comments