Skip to content

Commit 198f67b

Browse files
committed
add page on internal design
1 parent a47ff4e commit 198f67b

File tree

3 files changed

+142
-35
lines changed

3 files changed

+142
-35
lines changed

doc/internals/index.rst

+4-4
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
.. _internals:
22

3-
xarray Internals
3+
Xarray Internals
44
================
55

66
Xarray builds upon two of the foundational libraries of the scientific Python
@@ -11,15 +11,15 @@ compiled code to :ref:`optional dependencies<installing>`.
1111
The pages in this section are intended for:
1212

1313
* Contributors to xarray who wish to better understand some of the internals,
14-
* Developers who wish to extend xarray with domain-specific logic, perhaps to support a new scientific community of users,
15-
* Developers who wish to interface xarray with their existing tooling, e.g. by creating a plugin for reading a new file format, or wrapping a custom array type.
14+
* Developers from other fields who wish to extend xarray with domain-specific logic, perhaps to support a new scientific community of users,
15+
* Developers of other packages who wish to interface xarray with their existing tools, e.g. by creating a plugin for reading a new file format, or wrapping a custom array type.
1616

1717

1818
.. toctree::
1919
:maxdepth: 2
2020
:hidden:
2121

22-
variable-objects
22+
internal-design
2323
duck-arrays-integration
2424
chunked-arrays
2525
extending-xarray

doc/internals/internal-design.rst

+138
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
.. _internal design:
2+
3+
Internal Design
4+
===============
5+
6+
This page gives an overview of the internal design of xarray.
7+
8+
In totality, the Xarray project defines 4 key data structures.
9+
In order of increasing complexity, they are:
10+
11+
- :py:class:`xarray.Variable`,
12+
- :py:class:`xarray.DataArray`,
13+
- :py:class:`xarray.Dataset`,
14+
- :py:class:`datatree.DataTree`.
15+
16+
The user guide lists only :py:class:`xarray.DataArray` and :py:class:`xarray.Dataset`,
17+
but :py:class:`~xarray.Variable` is the fundamental object internally,
18+
and :py:class:`~datatree.DataTree` is a natural generalisation of :py:class:`xarray.Dataset`.
19+
20+
.. note::
21+
22+
Our :ref:`roadmap` includes plans both to document :py:class:`~xarray.Variable` as fully public API,
23+
and to merge the `xarray-datatree <https://github.com/xarray-contrib/datatree>`_ package into xarray's main repository.
24+
25+
Internally private :ref:`lazy indexing classes <internal design.lazy indexing>` are used to avoid loading more data than necessary,
26+
and flexible indexes classes (derived from :py:class:`~xarray.indexes.Index`) provide performant label-based lookups.
27+
28+
29+
.. _internal design.data structures:
30+
31+
Data Structures
32+
---------------
33+
34+
The :ref:`data structures` page in the user guide explains the basics and concentrates on user-facing behavior,
35+
whereas this section explains how xarray's data structure classes actually work internally.
36+
37+
38+
.. _internal design.data structures.variable:
39+
40+
Variable Objects
41+
~~~~~~~~~~~~~~~~
42+
43+
The core internal data structure in xarray is the :py:class:`~xarray.Variable`,
44+
which is used as the basic building block behind xarray's
45+
:py:class:`~xarray.Dataset`, :py:class:`~xarray.DataArray` types. A
46+
:py:class:`~xarray.Variable` consists of:
47+
48+
- ``dims``: A tuple of dimension names.
49+
- ``data``: The N-dimensional array (typically a NumPy or Dask array) storing
50+
the Variable's data. It must have the same number of dimensions as the length
51+
of ``dims``.
52+
- ``attrs``: An ordered dictionary of metadata associated with this array. By
53+
convention, xarray's built-in operations never use this metadata.
54+
- ``encoding``: Another ordered dictionary used to store information about how
55+
these variable's data is represented on disk. See :ref:`io.encoding` for more
56+
details.
57+
58+
:py:class:`~xarray.Variable` has an interface similar to NumPy arrays, but extended to make use
59+
of named dimensions. For example, it uses ``dim`` in preference to an ``axis``
60+
argument for methods like ``mean``, and supports :ref:`compute.broadcasting`.
61+
62+
However, unlike ``Dataset`` and ``DataArray``, the basic ``Variable`` does not
63+
include coordinate labels along each axis.
64+
65+
:py:class:`~xarray.Variable` is public API, but because of its incomplete support for labeled
66+
data, it is mostly intended for advanced uses, such as in xarray itself, for
67+
writing new backends, or when creating custom indexes.
68+
You can access the variable objects that correspond to xarray objects via the (readonly)
69+
:py:attr:`Dataset.variables <xarray.Dataset.variables>` and
70+
:py:attr:`DataArray.variable <xarray.DataArray.variable>` attributes.
71+
72+
73+
.. _internal design.dataarray:
74+
75+
DataArray Objects
76+
~~~~~~~~~~~~~~~~~
77+
78+
The simplest data structure used by most users is :py:class:`~xarray.DataArray`.
79+
A :py:class:`~xarray.DataArray` is a composite object consisting of multiple
80+
:py:class:`~xarray.core.variable.Variable` objects which store related data.
81+
82+
A single :py:class:`~xarray.core.Variable` is referred to as the "data variable", and stored under the :py:attr:`~xarray.DataArray.variable`` attribute.
83+
A :py:class:`~xarray.DataArray` inherits all of the properties of this data variable, i.e. ``dims``, ``data``, ``attrs`` and ``encoding``,
84+
all of which are implemented by forwarding on to the underlying ``Variable`` object.
85+
86+
In addition, a :py:class:`~xarray.DataArray` stores additional ``Variable`` objects stored in a dict under the private ``_coords`` attribute,
87+
each of which is referred to as a "Coordinate Variable". These coordinate variable objects are only allowed to have ``dims`` that are a subset of the data variable's ``dims``,
88+
and each dim has a specific length. This means that the full :py:attr:`~xarray.DataArray.size` of the dataarray can be represented by a dictionary mapping dimension names to integer sizes.
89+
The underlying data variable has this exact same size, and the attached coordinate variables have sizes which are some subset of the size of the data variable.
90+
Another way of saying this is that all coordinate variables must be "alignable" with the data variable.
91+
92+
When a coordinate is accessed by the user (e.g. via the dict-like :py:class:`~xarray.DataArray.__getitem__` syntax),
93+
then a new ``DataArray`` is constructed by finding all coordinate variables that have compatible dimensions and re-attaching them before the result is returned.
94+
This is why most users never see the ``Variable`` class underlying each coordinate variable - it is always promoted to a ``DataArray`` before returning.
95+
96+
Lookups are performed by special :py:class:`~xarray.indexes.Index` objects, which are stored in a dict under the private ``_indexes`` attribute.
97+
Indexes must be associated with one or more coordinates, and essentially act by translating a query given in physical coordinate space
98+
(typically via the :py:meth:`~xarray.DataArray.sel` method) into a set of integer indices in array index space that can be used to index the underlying n-dimensional array-like ``data``.
99+
Indexing in array index space (typically performed via the :py:meth:`~xarray.DataArray.sel` method) does not require consulting an ``Index`` object.
100+
101+
Finally a :py:class:`~xarray.DataArray` defines a :py:attr:`~xarray.DataArray.name` attribute, which refers to its data
102+
variable but is stored on the wrapping ``DataArray`` class.
103+
The ``name`` attribute is primarily used when one or more :py:class:`~xarray.DataArray` objects are promoted into a :py:class:`~xarray.Dataset`
104+
(e.g. via :py:meth:`~xarray.DataArray.to_dataset`).
105+
Note that the underlying :py:class:`~xarray.core.Variable` objects are all unnamed, so they can always be referred to uniquely via a
106+
dict-like mapping.
107+
108+
.. _internal design.dataset:
109+
110+
Dataset Objects
111+
~~~~~~~~~~~~~~~
112+
113+
The :py:class:`~xarray.Dataset` class is a generalization of the :py:class:`~xarray.DataArray` class that can hold multiple data variables.
114+
Internally all data variables and coordinate variables are stored under a single ``variables`` dict, and coordinates are
115+
specified by storing their names in a private ``_coord_names`` dict.
116+
117+
The dataset's ``dims`` are the set of all dims present across any variable, but (similar to in dataarrays) coordinate
118+
variables cannot have a dimension that is not present on any data variable.
119+
120+
When a data variable or coordinate variable is accessed, a new ``DataArray`` is again constructed from all compatible
121+
coordinates before returning.
122+
123+
.. _internal design.subclassing:
124+
125+
.. note::
126+
127+
The way that selecting a variable from a ``DataArray`` or ``Dataset`` actually involves internally wrapping the
128+
``Variable`` object back up into a ``DataArray``/``Dataset`` is the primary reason :ref:`we recommend against subclassing <internals.accessors.composition>`
129+
Xarray objects. The main problem it creates is that we currently cannot easily guarantee that for example selecting
130+
a coordinate variable from your ``SubclassedDataArray`` would return an instance of ``SubclassedDataArray`` instead
131+
of just an :py:class:`xarray.DataArray`. See `GH issue <https://github.com/pydata/xarray/issues/3980>`_ for more details.
132+
133+
.. _internal design.lazy indexing:
134+
135+
Lazy Indexing Classes
136+
---------------------
137+
138+
TODO

doc/internals/variable-objects.rst

-31
This file was deleted.

0 commit comments

Comments
 (0)