You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We could implement efficient append()/extend() methods with something like the following strategy:
Write a ResizableArray duck array class, that stores its data in a continuous numpy array, which allows for appending along its first (outer-most) dimension and is resized every time an exponentially growing size threshold is exceeded using standard methods for dynamically resized arrays. TensorFlow's streaming_concat is a good example of the right logic. This class would expose a duck array interface suitable for use in xarray (e.g., similar to xarray.core.indexing.CopyOnWriteArray), along with .append and extend methods that handle numpy arrays.
Write a ResizableVariable subclass of Variable that stores its data in a ResizableArray. It works like a Variable but adds append and extend methods.
Write a ResizableDataset subclass of Dataset that uses ResizableVariable instead of Variable objects and has an additional argument for specifying the "resizable" dimension. All variables put in a ResizableDataset that use the resizable dimension must have it as their first dimension, and you cannot make an index along this dimension, because pandas.Index cannot be efficiently resized. Operations other than append/extend on a ResizableDataset return a normal Dataset.
One consideration for whether this is actually worth doing is if there are common workflows that make use of intermediate accumulated datasets (maybe trailing window operations?).
The text was updated successfully, but these errors were encountered:
Would this also somehow be related to unlimited dims and appending to an existing netCDF file?
Something like:
ds = xarray.ResizableDataset(dim='example')
ds.to_file('file.nc')
for example_ds in ...:
ds.append(example_ds)
ds.flush() # the data exists on disk but not on memory any more
Writing up this proposal was actually enough for me to realize that this is probably not a good idea. There just aren't many use cases for this:
Either you don't care about the speed, and you happily write xarray.concat in a loop.
Or you care about performance, which means you recognize that any operations that take linear time on an xarray.Dataset are too slow. The only operations that really make sense are trailing window operations, which you might as well do calling something like xarray.concat(list_of_datasets[-10:], dim='example') to construct the trailing window dataset from scratch.
Would this also somehow be related to unlimited dims and appending to an existing netCDF file?
Yes, it's sort of a similar idea. But honestly, I think this use case would be better served by a keyword argument on to_netcdf, e.g., ds.to_netcdf('file.nc', extend='example') extends file.nc along the "example" dimension, or creates a new file with "example" as an unlimited dimension.
This has come up a few times recently, e.g., on StackOverflow.
We could implement efficient
append()
/extend()
methods with something like the following strategy:Write a
ResizableArray
duck array class, that stores its data in a continuous numpy array, which allows for appending along its first (outer-most) dimension and is resized every time an exponentially growing size threshold is exceeded using standard methods for dynamically resized arrays. TensorFlow'sstreaming_concat
is a good example of the right logic. This class would expose a duck array interface suitable for use in xarray (e.g., similar toxarray.core.indexing.CopyOnWriteArray
), along with.append
andextend
methods that handle numpy arrays.Write a
ResizableVariable
subclass ofVariable
that stores its data in aResizableArray
. It works like aVariable
but addsappend
andextend
methods.Write a
ResizableDataset
subclass ofDataset
that usesResizableVariable
instead ofVariable
objects and has an additional argument for specifying the "resizable" dimension. All variables put in aResizableDataset
that use the resizable dimension must have it as their first dimension, and you cannot make an index along this dimension, becausepandas.Index
cannot be efficiently resized. Operations other thanappend
/extend
on aResizableDataset
return a normalDataset
.The end user API could look something like this:
One consideration for whether this is actually worth doing is if there are common workflows that make use of intermediate accumulated datasets (maybe trailing window operations?).
The text was updated successfully, but these errors were encountered: