Skip to content

Commit 7084df0

Browse files
authored
WIP: Optional indexes (no more default coordinates given by range(n)) (#1017)
* Indexes are now optional * add issue link on optional-indexes to what's new * Fix test failure on windows * use shared dimension summary in formatting.py * missing coordinates appear in the repr * Mark missing coords with "o" in the repr
1 parent 1615a0f commit 7084df0

30 files changed

+1072
-721
lines changed

doc/api.rst

+2
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ Attributes
4747
Dataset.coords
4848
Dataset.attrs
4949
Dataset.indexes
50+
Dataset.get_index
5051

5152
Dictionary interface
5253
--------------------
@@ -196,6 +197,7 @@ Attributes
196197
DataArray.attrs
197198
DataArray.encoding
198199
DataArray.indexes
200+
DataArray.get_index
199201

200202
**ndarray attributes**:
201203
:py:attr:`~DataArray.ndim`

doc/computation.rst

+11-6
Original file line numberDiff line numberDiff line change
@@ -196,7 +196,9 @@ This means, for example, that you always subtract an array from its transpose:
196196
You can explicitly broadcast xaray data structures by using the
197197
:py:func:`~xarray.broadcast` function:
198198

199-
a2, b2 = xr.broadcast(a, b2)
199+
.. ipython:: python
200+
201+
a2, b2 = xr.broadcast(a, b)
200202
a2
201203
b2
202204
@@ -215,15 +217,18 @@ operations. The default result of a binary operation is by the *intersection*
215217

216218
.. ipython:: python
217219
218-
arr + arr[:1]
220+
arr = xr.DataArray(np.arange(3), [('x', range(3))])
221+
arr + arr[:-1]
219222
220-
If the result would be empty, an error is raised instead:
223+
If coordinate values for a dimension are missing on either argument, all
224+
matching dimensions must have the same size:
221225

222-
.. ipython::
226+
.. ipython:: python
223227
224228
@verbatim
225-
In [1]: arr[:2] + arr[2:]
226-
ValueError: no overlapping labels for some dimensions: ['x']
229+
In [1]: arr + xr.DataArray([1, 2], dims='x')
230+
ValueError: arguments without labels along dimension 'x' cannot be aligned because they have different dimension size(s) {2} than the size of the aligned dimension labels: 3
231+
227232
228233
However, one can explicitly change this default automatic alignment type ("inner")
229234
via :py:func:`~xarray.set_options()` in context manager:

doc/data-structures.rst

+35-20
Original file line numberDiff line numberDiff line change
@@ -67,18 +67,33 @@ in with default values:
6767
6868
xr.DataArray(data)
6969
70-
As you can see, dimensions and coordinate arrays corresponding to each
71-
dimension are always present. This behavior is similar to pandas, which fills
72-
in index values in the same way.
70+
As you can see, dimension names are always present in the xarray data model: if
71+
you do not provide them, defaults of the form ``dim_N`` will be created.
72+
73+
.. note::
74+
75+
Prior to xarray v0.9, coordinates corresponding to dimension were *also*
76+
always present in xarray: xarray would create default coordinates of the form
77+
``range(dim_size)`` if coordinates were not supplied explicitly. This is no
78+
longer the case.
7379

7480
Coordinates can take the following forms:
7581

76-
- A list of ``(dim, ticks[, attrs])`` pairs with length equal to the number of dimensions
77-
- A dictionary of ``{coord_name: coord}`` where the values are each a scalar value,
78-
a 1D array or a tuple. Tuples are be in the same form as the above, and
79-
multiple dimensions can be supplied with the form ``(dims, data[, attrs])``.
80-
Supplying as a tuple allows other coordinates than those corresponding to
81-
dimensions (more on these later).
82+
- A list of values with length equal to the number of dimensions, providing
83+
coordinate labels for each dimension. Each value must be of one of the
84+
following forms:
85+
86+
* A :py:class:`~xarray.DataArray` or :py:class:`~xarray.Variable`
87+
* A tuple of the form ``(dims, data[, attrs])``, which is converted into
88+
arguments for :py:class:`~xarray.Variable`
89+
* A pandas object or scalar value, which is converted into a ``DataArray``
90+
* A 1D array or list, which is interpreted as values for a one dimensional
91+
coordinate variable along the same dimension as it's name
92+
93+
- A dictionary of ``{coord_name: coord}`` where values are of the same form
94+
as the list. Supplying coordinates as a dictionary allows other coordinates
95+
than those corresponding to dimensions (more on these later). If you supply
96+
``coords`` as a dictionary, you must explicitly provide ``dims``.
8297

8398
As a list of tuples:
8499

@@ -128,7 +143,7 @@ Let's take a look at the important properties on our array:
128143
foo.attrs
129144
print(foo.name)
130145
131-
You can even modify ``values`` inplace:
146+
You can modify ``values`` inplace:
132147

133148
.. ipython:: python
134149
@@ -228,15 +243,19 @@ Creating a Dataset
228243
To make an :py:class:`~xarray.Dataset` from scratch, supply dictionaries for any
229244
variables (``data_vars``), coordinates (``coords``) and attributes (``attrs``).
230245

231-
``data_vars`` are supplied as a dictionary with each key as the name of the variable and each
246+
- ``data_vars`` should be a dictionary with each key as the name of the variable and each
232247
value as one of:
233248

234-
- A :py:class:`~xarray.DataArray`
235-
- A tuple of the form ``(dims, data[, attrs])``
236-
- A pandas object
249+
* A :py:class:`~xarray.DataArray` or :py:class:`~xarray.Variable`
250+
* A tuple of the form ``(dims, data[, attrs])``, which is converted into
251+
arguments for :py:class:`~xarray.Variable`
252+
* A pandas object, which is converted into a ``DataArray``
253+
* A 1D array or list, which is interpreted as values for a one dimensional
254+
coordinate variable along the same dimension as it's name
255+
256+
- ``coords`` should be a dictionary of the same form as ``data_vars``.
237257

238-
``coords`` are supplied as dictionary of ``{coord_name: coord}`` where the values are scalar values,
239-
arrays or tuples in the form of ``(dims, data[, attrs])``.
258+
- ``attrs`` should be a dictionary.
240259

241260
Let's create some fake data for the example we show above:
242261

@@ -257,10 +276,6 @@ Let's create some fake data for the example we show above:
257276
'reference_time': pd.Timestamp('2014-09-05')})
258277
ds
259278
260-
Notice that we did not explicitly include coordinates for the "x" or "y"
261-
dimensions, so they were filled in array of ascending integers of the proper
262-
length.
263-
264279
Here we pass :py:class:`xarray.DataArray` objects or a pandas object as values
265280
in the dictionary:
266281

doc/examples/quick-overview.rst

+37-13
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ array or list, with optional *dimensions* and *coordinates*:
2323
.. ipython:: python
2424
2525
xr.DataArray(np.random.randn(2, 3))
26-
data = xr.DataArray(np.random.randn(2, 3), [('x', ['a', 'b']), ('y', [-2, 0, 2])])
26+
data = xr.DataArray(np.random.randn(2, 3), coords={'x': ['a', 'b']}, dims=('x', 'y'))
2727
data
2828
2929
If you supply a pandas :py:class:`~pandas.Series` or
@@ -121,31 +121,55 @@ xarray supports grouped operations using a very similar API to pandas:
121121
data.groupby(labels).mean('y')
122122
data.groupby(labels).apply(lambda x: x - x.min())
123123
124-
Convert to pandas
125-
-----------------
124+
pandas
125+
------
126126

127-
A key feature of xarray is robust conversion to and from pandas objects:
127+
Xarray objects can be easily converted to and from pandas objects:
128128

129129
.. ipython:: python
130130
131-
data.to_series()
132-
data.to_pandas()
131+
series = data.to_series()
132+
series
133133
134-
Datasets and NetCDF
135-
-------------------
134+
# convert back
135+
series.to_xarray()
136136
137-
:py:class:`xarray.Dataset` is a dict-like container of ``DataArray`` objects that share
138-
index labels and dimensions. It looks a lot like a netCDF file:
137+
Datasets
138+
--------
139+
140+
:py:class:`xarray.Dataset` is a dict-like container of aligned ``DataArray``
141+
objects. You can think of it as a multi-dimensional generalization of the
142+
:py:class:`pandas.DataFrame`:
139143

140144
.. ipython:: python
141145
142-
ds = data.to_dataset(name='foo')
146+
ds = xr.Dataset({'foo': data, 'bar': ('x', [1, 2]), 'baz': np.pi})
143147
ds
144148
149+
Use dictionary indexing to pull out ``Dataset`` variables as ``DataArray``
150+
objects:
151+
152+
.. ipython:: python
153+
154+
ds['foo']
155+
156+
Variables in datasets can have different ``dtype`` and even different
157+
dimensions, but all dimensions are assumed to refer to points in the same shared
158+
coordinate system.
159+
145160
You can do almost everything you can do with ``DataArray`` objects with
146-
``Dataset`` objects if you prefer to work with multiple variables at once.
161+
``Dataset`` objects (including indexing and arithmetic) if you prefer to work
162+
with multiple variables at once.
163+
164+
NetCDF
165+
------
166+
167+
NetCDF is the recommended binary serialization format for xarray objects. Users
168+
from the geosciences will recognize that the :py:class:`~xarray.Dataset` data
169+
model looks very similar to a netCDF file (which, in fact, inspired it).
147170

148-
Datasets also let you easily read and write netCDF files:
171+
You can directly read and write xarray objects to disk using :py:meth:`~xarray.Dataset.to_netcdf`, :py:func:`~xarray.open_dataset` and
172+
:py:func:`~xarray.open_dataarray`:
149173

150174
.. ipython:: python
151175

doc/indexing.rst

+33-1
Original file line numberDiff line numberDiff line change
@@ -221,7 +221,7 @@ enabling nearest neighbor (inexact) lookups by use of the methods ``'pad'``,
221221

222222
.. ipython:: python
223223
224-
data = xr.DataArray([1, 2, 3], dims='x')
224+
data = xr.DataArray([1, 2, 3], [('x', [0, 1, 2])])
225225
data.sel(x=[1.1, 1.9], method='nearest')
226226
data.sel(x=0.1, method='backfill')
227227
data.reindex(x=[0.5, 1, 1.5, 2, 2.5], method='pad')
@@ -478,6 +478,30 @@ Both ``reindex_like`` and ``align`` work interchangeably between
478478
# this is a no-op, because there are no shared dimension names
479479
ds.reindex_like(other)
480480
481+
.. _indexing.missing_coordinates:
482+
483+
Missing coordinate labels
484+
-------------------------
485+
486+
Coordinate labels for each dimension are optional (as of xarray v0.9). Label
487+
based indexing with ``.sel`` and ``.loc`` uses standard positional,
488+
integer-based indexing as a fallback for dimensions without a coordinate label:
489+
490+
.. ipython:: python
491+
492+
array = xr.DataArray([1, 2, 3], dims='x')
493+
array.sel(x=[0, -1])
494+
495+
Alignment between xarray objects where one or both do not have coordinate labels
496+
succeeds only if all dimensions of the same name have the same length.
497+
Otherwise, it raises an informative error:
498+
499+
.. ipython::
500+
:verbatim:
501+
502+
In [62]: xr.align(array, array[:2])
503+
ValueError: arguments without labels along dimension 'x' cannot be aligned because they have different dimension sizes: {2, 3}
504+
481505
Underlying Indexes
482506
------------------
483507

@@ -491,3 +515,11 @@ through the :py:attr:`~xarray.DataArray.indexes` attribute.
491515
arr.indexes
492516
arr.indexes['time']
493517
518+
Use :py:meth:`~xarray.DataArray.get_index` to get an index for a dimension,
519+
falling back to a default :py:class:`pandas.RangeIndex` if it has no coordinate
520+
labels:
521+
522+
.. ipython:: python
523+
524+
array
525+
array.get_index('x')

doc/whats-new.rst

+31-2
Original file line numberDiff line numberDiff line change
@@ -21,10 +21,32 @@ v0.9.0 (unreleased)
2121
Breaking changes
2222
~~~~~~~~~~~~~~~~
2323

24+
- Index coordinates for each dimensions are now optional, and no longer created
25+
by default :issue:`1017`. This has a number of implications:
26+
27+
- :py:func:`~align` and :py:meth:`~Dataset.reindex` can now error, if
28+
dimensions labels are missing and dimensions have different sizes.
29+
- Because pandas does not support missing indexes, methods such as
30+
``to_dataframe``/``from_dataframe`` and ``stack``/``unstack`` no longer
31+
roundtrip faithfully on all inputs. Use :py:meth:`~Dataset.reset_index` to
32+
remove undesired indexes.
33+
- ``Dataset.__delitem__`` and :py:meth:`~Dataset.drop` no longer delete/drop
34+
variables that have dimensions matching a deleted/dropped variable.
35+
- ``DataArray.coords.__delitem__`` is now allowed on variables matching
36+
dimension names.
37+
- ``.sel`` and ``.loc`` now handle indexing along a dimension without
38+
coordinate labels by doing integer based indexing. See
39+
:ref:`indexing.missing_coordinates` for an example.
40+
- :py:attr:`~Dataset.indexes` is no longer guaranteed to include all
41+
dimensions names as keys. The new method :py:meth:`~Dataset.get_index` has
42+
been added to get an index for a dimension guaranteed, falling back to
43+
produce a default ``RangeIndex`` if necessary.
44+
2445
- The default behavior of ``merge`` is now ``compat='no_conflicts'``, so some
2546
merges will now succeed in cases that previously raised
2647
``xarray.MergeError``. Set ``compat='broadcast_equals'`` to restore the
27-
previous default.
48+
previous default. See :ref:`combining.no_conflicts` for more details.
49+
2850
- Reading :py:attr:`~DataArray.values` no longer always caches values in a NumPy
2951
array :issue:`1128`. Caching of ``.values`` on variables read from netCDF
3052
files on disk is still the default when :py:func:`open_dataset` is called with
@@ -150,6 +172,13 @@ Bug fixes
150172
should be computed or not.
151173
By `Fabien Maussion <https://github.com/fmaussion>`_.
152174

175+
- Grouping over an dimension with non-unique values with ``groupby`` gives
176+
correct groups.
177+
By `Stephan Hoyer <https://github.com/shoyer>`_.
178+
179+
- Fixed accessing coordinate variables with non-string names from ``.coords``.
180+
By `Stephan Hoyer <https://github.com/shoyer>`_.
181+
153182
- :py:meth:`~xarray.DataArray.rename` now simultaneously renames the array and
154183
any coordinate with the same name, when supplied via a :py:class:`dict`
155184
(:issue:`1116`).
@@ -1280,7 +1309,7 @@ Enhancements
12801309

12811310
.. ipython:: python
12821311
1283-
data = xray.DataArray([1, 2, 3], dims='x')
1312+
data = xray.DataArray([1, 2, 3], [('x', range(3))])
12841313
data.reindex(x=[0.5, 1, 1.5, 2, 2.5], method='pad')
12851314
12861315
This will be especially useful once pandas 0.16 is released, at which point

xarray/backends/common.py

-25
Original file line numberDiff line numberDiff line change
@@ -33,25 +33,6 @@ def _decode_variable_name(name):
3333
return name
3434

3535

36-
def is_trivial_index(var):
37-
"""
38-
Determines if in index is 'trivial' meaning that it is
39-
equivalent to np.arange(). This is determined by
40-
checking if there are any attributes or encodings,
41-
if ndims is one, dtype is int and finally by comparing
42-
the actual values to np.arange()
43-
"""
44-
# if either attributes or encodings are defined
45-
# the index is not trivial.
46-
if len(var.attrs) or len(var.encoding):
47-
return False
48-
# if the index is not a 1d integer array
49-
if var.ndim > 1 or not var.dtype.kind == 'i':
50-
return False
51-
arange = np.arange(var.size, dtype=var.dtype)
52-
return np.all(var.values == arange)
53-
54-
5536
def robust_getitem(array, key, catch=Exception, max_retries=6,
5637
initial_delay=500):
5738
"""
@@ -203,12 +184,6 @@ def store_dataset(self, dataset):
203184

204185
def store(self, variables, attributes, check_encoding_set=frozenset()):
205186
self.set_attributes(attributes)
206-
neccesary_dims = [v.dims for v in variables.values()]
207-
neccesary_dims = set(itertools.chain(*neccesary_dims))
208-
# set all non-indexes and any index which is not trivial.
209-
variables = OrderedDict((k, v) for k, v in iteritems(variables)
210-
if not (k in neccesary_dims and
211-
is_trivial_index(v)))
212187
self.set_variables(variables, check_encoding_set)
213188

214189
def set_attributes(self, attributes):

xarray/conventions.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -913,7 +913,7 @@ def decode_cf(obj, concat_characters=True, mask_and_scale=True,
913913
identify coordinates.
914914
drop_variables: string or iterable, optional
915915
A variable or list of variables to exclude from being parsed from the
916-
dataset.This may be useful to drop variables with problems or
916+
dataset. This may be useful to drop variables with problems or
917917
inconsistent values.
918918
919919
Returns
@@ -939,7 +939,7 @@ def decode_cf(obj, concat_characters=True, mask_and_scale=True,
939939
vars, attrs, concat_characters, mask_and_scale, decode_times,
940940
decode_coords, drop_variables=drop_variables)
941941
ds = Dataset(vars, attrs=attrs)
942-
ds = ds.set_coords(coord_names.union(extra_coords))
942+
ds = ds.set_coords(coord_names.union(extra_coords).intersection(vars))
943943
ds._file_obj = file_obj
944944
return ds
945945

0 commit comments

Comments
 (0)