append() method using dynamically resized arrays #1398

shoyer · 2017-05-04T00:57:44Z

This has come up a few times recently, e.g., on StackOverflow.

We could implement efficient append()/extend() methods with something like the following strategy:

Write a ResizableArray duck array class, that stores its data in a continuous numpy array, which allows for appending along its first (outer-most) dimension and is resized every time an exponentially growing size threshold is exceeded using standard methods for dynamically resized arrays. TensorFlow's streaming_concat is a good example of the right logic. This class would expose a duck array interface suitable for use in xarray (e.g., similar to xarray.core.indexing.CopyOnWriteArray), along with .append and extend methods that handle numpy arrays.
Write a ResizableVariable subclass of Variable that stores its data in a ResizableArray. It works like a Variable but adds append and extend methods.
Write a ResizableDataset subclass of Dataset that uses ResizableVariable instead of Variable objects and has an additional argument for specifying the "resizable" dimension. All variables put in a ResizableDataset that use the resizable dimension must have it as their first dimension, and you cannot make an index along this dimension, because pandas.Index cannot be efficiently resized. Operations other than append/extend on a ResizableDataset return a normal Dataset.

The end user API could look something like this:

ds = xarray.ResizableDataset(dim='example')
for example_ds in ...:
   ds.append(example_ds)

One consideration for whether this is actually worth doing is if there are common workflows that make use of intermediate accumulated datasets (maybe trailing window operations?).

The text was updated successfully, but these errors were encountered:

fmaussion · 2017-05-04T08:10:13Z

Would this also somehow be related to unlimited dims and appending to an existing netCDF file?

Something like:

ds = xarray.ResizableDataset(dim='example')
ds.to_file('file.nc')
for example_ds in ...:
   ds.append(example_ds)
   ds.flush()  # the data exists on disk but not on memory any more

shoyer · 2017-05-05T02:01:44Z

Writing up this proposal was actually enough for me to realize that this is probably not a good idea. There just aren't many use cases for this:

Either you don't care about the speed, and you happily write xarray.concat in a loop.
Or you care about performance, which means you recognize that any operations that take linear time on an xarray.Dataset are too slow. The only operations that really make sense are trailing window operations, which you might as well do calling something like xarray.concat(list_of_datasets[-10:], dim='example') to construct the trailing window dataset from scratch.

Would this also somehow be related to unlimited dims and appending to an existing netCDF file?

Yes, it's sort of a similar idea. But honestly, I think this use case would be better served by a keyword argument on to_netcdf, e.g., ds.to_netcdf('file.nc', extend='example') extends file.nc along the "example" dimension, or creates a new file with "example" as an unlimited dimension.

shoyer added the wontfix label Jun 17, 2017

shoyer closed this as completed Jun 17, 2017

shoyer mentioned this issue Oct 12, 2017

to_netcdf() fails to append to an existing file #1215

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

append() method using dynamically resized arrays #1398

append() method using dynamically resized arrays #1398

shoyer commented May 4, 2017

fmaussion commented May 4, 2017

shoyer commented May 5, 2017

append() method using dynamically resized arrays #1398

append() method using dynamically resized arrays #1398

Comments

shoyer commented May 4, 2017

fmaussion commented May 4, 2017

shoyer commented May 5, 2017