Skip to content

Commit 7c9a2fe

Browse files
committed
Merge pull request #473 from shoyer/concat-rewrite
Rewrite of xray.concat
2 parents a505463 + e6f6cbd commit 7c9a2fe

15 files changed

+715
-508
lines changed

doc/combining.rst

+4-3
Original file line numberDiff line numberDiff line change
@@ -65,9 +65,10 @@ Of course, ``concat`` also works on ``Dataset`` objects:
6565
xray.concat([ds.sel(x='a'), ds.sel(x='b')], 'x')
6666
6767
:py:func:`~xray.concat` has a number of options which provide deeper control
68-
over which variables and coordinates are concatenated and how it handles
69-
conflicting variables between datasets. However, these should rarely be
70-
necessary.
68+
over which variables are concatenated and how it handles conflicting variables
69+
between datasets. With the default parameters, xray will load some coordinate
70+
variables into memory to compare them between datasets. This may be prohibitively
71+
expensive if you are manipulating your dataset lazily using :ref:`dask`.
7172

7273
.. _merge:
7374

doc/whats-new.rst

+11
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,16 @@ What's New
1212
v0.5.2 (unreleased)
1313
-------------------
1414

15+
Backwards incompatible changes
16+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
17+
18+
- The optional arguments ``concat_over`` and ``mode`` in :py:func:`~xray.concat` have
19+
been removed and replaced by ``data_vars`` and ``coords``. The new arguments are both
20+
more easily understood and more robustly implemented, and allowed us to fix a bug
21+
where ``concat`` accidentally loaded data into memory. If you set values for
22+
these optional arguments manually, you will need to update your code. The default
23+
behavior should be unchanged.
24+
1525
Enhancements
1626
~~~~~~~~~~~~
1727

@@ -47,6 +57,7 @@ Bug fixes
4757
supplying chunks as a single integer.
4858
- Fixed a bug in serializing scalar datetime variable to netCDF.
4959
- Fixed a bug that could occur in serialization of 0-dimensional integer arrays.
60+
- Fixed a bug where concatenating DataArrays was not always lazy (:issue:`464`).
5061
5162
v0.5.1 (15 June 2015)
5263
---------------------

xray/__init__.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
from .core.alignment import align, broadcast_arrays, concat, auto_combine
1+
from .core.alignment import align, broadcast_arrays
2+
from .core.combine import concat, auto_combine
23
from .core.variable import Variable, Coordinate
34
from .core.dataset import Dataset
45
from .core.dataarray import DataArray

xray/backends/api.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55

66
from .. import backends, conventions
77
from .common import ArrayWriter
8-
from ..core.alignment import auto_combine
8+
from ..core.combine import auto_combine
99
from ..core.utils import close_on_error, is_remote_uri
1010
from ..core.pycompat import basestring, OrderedDict, range
1111

xray/core/alignment.py

+4-138
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@
66

77
from . import ops, utils
88
from .common import _maybe_promote
9-
from .pycompat import iteritems, OrderedDict, reduce
9+
from .pycompat import iteritems, OrderedDict
1010
from .utils import is_full_slice
11-
from .variable import as_variable, Variable, Coordinate, broadcast_variables
11+
from .variable import Variable, Coordinate, broadcast_variables
1212

1313

1414
def _get_joiner(join):
@@ -218,142 +218,6 @@ def var_indexers(var, indexers):
218218
return reindexed
219219

220220

221-
def concat(objs, dim='concat_dim', indexers=None, mode='different',
222-
concat_over=None, compat='equals'):
223-
"""Concatenate xray objects along a new or existing dimension.
224-
225-
Parameters
226-
----------
227-
objs : sequence of Dataset and DataArray objects
228-
xray objects to concatenate together. Each object is expected to
229-
consist of variables and coordinates with matching shapes except for
230-
along the concatenated dimension.
231-
dim : str or DataArray or Index, optional
232-
Name of the dimension to concatenate along. This can either be a new
233-
dimension name, in which case it is added along axis=0, or an existing
234-
dimension name, in which case the location of the dimension is
235-
unchanged. If dimension is provided as a DataArray or Index, its name
236-
is used as the dimension to concatenate along and the values are added
237-
as a coordinate.
238-
indexers : None or iterable of indexers, optional
239-
Iterable of indexers of the same length as datasets which
240-
specifies how to assign variables from each dataset along the given
241-
dimension. If not supplied, indexers is inferred from the length of
242-
each variable along the dimension, and the variables are stacked in
243-
the given order.
244-
mode : {'minimal', 'different', 'all'}, optional
245-
Decides which variables are concatenated. Choices are 'minimal'
246-
in which only variables in which dimension already appears are
247-
included, 'different' in which all variables which are not equal
248-
(ignoring attributes) across all datasets are concatenated (as well
249-
as all for which dimension already appears), and 'all' for which all
250-
variables are concatenated.
251-
concat_over : None or str or iterable of str, optional
252-
Names of additional variables to concatenate, in which the provided
253-
parameter ``dim`` does not already appear as a dimension. The default
254-
value includes all data variables.
255-
compat : {'equals', 'identical'}, optional
256-
String indicating how to compare non-concatenated variables and
257-
dataset global attributes for potential conflicts. 'equals' means
258-
that all variable values and dimensions must be the same;
259-
'identical' means that variable attributes and global attributes
260-
must also be equal.
261-
262-
Returns
263-
-------
264-
concatenated : type of objs
265-
266-
See also
267-
--------
268-
auto_combine
269-
"""
270-
# TODO: add join and ignore_index arguments copied from pandas.concat
271-
# TODO: support concatenating scaler coordinates even if the concatenated
272-
# dimension already exists
273-
try:
274-
first_obj, objs = utils.peek_at(objs)
275-
except StopIteration:
276-
raise ValueError('must supply at least one object to concatenate')
277-
cls = type(first_obj)
278-
return cls._concat(objs, dim, indexers, mode, concat_over, compat)
279-
280-
281-
def _auto_concat(datasets, dim=None):
282-
if len(datasets) == 1:
283-
return datasets[0]
284-
else:
285-
if dim is None:
286-
ds0 = datasets[0]
287-
ds1 = datasets[1]
288-
concat_dims = set(ds0.dims)
289-
if ds0.dims != ds1.dims:
290-
dim_tuples = set(ds0.dims.items()) - set(ds1.dims.items())
291-
concat_dims = set(i for i, _ in dim_tuples)
292-
if len(concat_dims) > 1:
293-
concat_dims = set(d for d in concat_dims
294-
if not ds0[d].equals(ds1[d]))
295-
if len(concat_dims) > 1:
296-
raise ValueError('too many different dimensions to '
297-
'concatenate: %s' % concat_dims)
298-
elif len(concat_dims) == 0:
299-
raise ValueError('cannot infer dimension to concatenate: '
300-
'supply the ``concat_dim`` argument '
301-
'explicitly')
302-
dim, = concat_dims
303-
return concat(datasets, dim=dim)
304-
305-
306-
def auto_combine(datasets, concat_dim=None):
307-
"""Attempt to auto-magically combine the given datasets into one.
308-
309-
This method attempts to combine a list of datasets into a single entity by
310-
inspecting metadata and using a combination of concat and merge.
311-
312-
It does not concatenate along more than one dimension or align or sort data
313-
under any circumstances. It will fail in complex cases, for which you
314-
should use ``concat`` and ``merge`` explicitly.
315-
316-
When ``auto_combine`` may succeed:
317-
318-
* You have N years of data and M data variables. Each combination of a
319-
distinct time period and test of data variables is saved its own dataset.
320-
321-
Examples of when ``auto_combine`` fails:
322-
323-
* In the above scenario, one file is missing, containing the data for one
324-
year's data for one variable.
325-
* In the most recent year, there is an additional data variable.
326-
* Your data includes "time" and "station" dimensions, and each year's data
327-
has a different set of stations.
328-
329-
Parameters
330-
----------
331-
datasets : sequence of xray.Dataset
332-
Dataset objects to merge.
333-
concat_dim : str or DataArray or Index, optional
334-
Dimension along which to concatenate variables, as used by
335-
:py:func:`xray.concat`. You only need to provide this argument if the
336-
dimension along which you want to concatenate is not a dimension in
337-
the original datasets, e.g., if you want to stack a collection of
338-
2D arrays along a third dimension.
339-
340-
Returns
341-
-------
342-
combined : xray.Dataset
343-
344-
See also
345-
--------
346-
concat
347-
Dataset.merge
348-
"""
349-
from toolz import itertoolz
350-
grouped = itertoolz.groupby(lambda ds: tuple(sorted(ds.data_vars)),
351-
datasets).values()
352-
concatenated = [_auto_concat(ds, dim=concat_dim) for ds in grouped]
353-
merged = reduce(lambda ds, other: ds.merge(other), concatenated)
354-
return merged
355-
356-
357221
def broadcast_arrays(*args):
358222
"""Explicitly broadcast any number of DataArrays against one another.
359223
@@ -376,6 +240,8 @@ def broadcast_arrays(*args):
376240
ValueError
377241
If indexes on the different arrays are not aligned.
378242
"""
243+
# TODO: fixme for coordinate arrays
244+
379245
from .dataarray import DataArray
380246

381247
all_indexes = _get_all_indexes(args)

0 commit comments

Comments
 (0)