Skip to content

append() method using dynamically resized arrays #1398

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shoyer opened this issue May 4, 2017 · 2 comments
Closed

append() method using dynamically resized arrays #1398

shoyer opened this issue May 4, 2017 · 2 comments

Comments

@shoyer
Copy link
Member

shoyer commented May 4, 2017

This has come up a few times recently, e.g., on StackOverflow.

We could implement efficient append()/extend() methods with something like the following strategy:

  1. Write a ResizableArray duck array class, that stores its data in a continuous numpy array, which allows for appending along its first (outer-most) dimension and is resized every time an exponentially growing size threshold is exceeded using standard methods for dynamically resized arrays. TensorFlow's streaming_concat is a good example of the right logic. This class would expose a duck array interface suitable for use in xarray (e.g., similar to xarray.core.indexing.CopyOnWriteArray), along with .append and extend methods that handle numpy arrays.

  2. Write a ResizableVariable subclass of Variable that stores its data in a ResizableArray. It works like a Variable but adds append and extend methods.

  3. Write a ResizableDataset subclass of Dataset that uses ResizableVariable instead of Variable objects and has an additional argument for specifying the "resizable" dimension. All variables put in a ResizableDataset that use the resizable dimension must have it as their first dimension, and you cannot make an index along this dimension, because pandas.Index cannot be efficiently resized. Operations other than append/extend on a ResizableDataset return a normal Dataset.

The end user API could look something like this:

ds = xarray.ResizableDataset(dim='example')
for example_ds in ...:
   ds.append(example_ds)

One consideration for whether this is actually worth doing is if there are common workflows that make use of intermediate accumulated datasets (maybe trailing window operations?).

@fmaussion
Copy link
Member

Would this also somehow be related to unlimited dims and appending to an existing netCDF file?

Something like:

ds = xarray.ResizableDataset(dim='example')
ds.to_file('file.nc')
for example_ds in ...:
   ds.append(example_ds)
   ds.flush()  # the data exists on disk but not on memory any more

@shoyer
Copy link
Member Author

shoyer commented May 5, 2017

Writing up this proposal was actually enough for me to realize that this is probably not a good idea. There just aren't many use cases for this:

  • Either you don't care about the speed, and you happily write xarray.concat in a loop.
  • Or you care about performance, which means you recognize that any operations that take linear time on an xarray.Dataset are too slow. The only operations that really make sense are trailing window operations, which you might as well do calling something like xarray.concat(list_of_datasets[-10:], dim='example') to construct the trailing window dataset from scratch.

Would this also somehow be related to unlimited dims and appending to an existing netCDF file?

Yes, it's sort of a similar idea. But honestly, I think this use case would be better served by a keyword argument on to_netcdf, e.g., ds.to_netcdf('file.nc', extend='example') extends file.nc along the "example" dimension, or creates a new file with "example" as an unlimited dimension.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants