Description
Here are two toy datasets designed to represent sections of a dataset that has variables living on a staggered grid. This type of dataset is common in fluid modelling (handling staggered grids is why xGCM exists).
import xarray as xr
ds1 = xr.Dataset(
data_vars={
'a': ('x_center', [1, 2, 3]),
'b': ('x_outer', [0.5, 1.5, 2.5, 3.5]),
},
)
ds2 = xr.Dataset(
data_vars={
'a': ('x_center', [4, 5, 6]),
'b': ('x_outer', [4.5, 5.5, 6.5]),
},
)
I have netcdf output files from an ocean model (UCLA-ROMS) that have this basic structure.
Combining these types of datasets seems like a bit of a pain to do with kerchunk at the moment.
To concatenate along the x direction, I actually need to concatenate a
along x_center
, and b
along x_outer
. So presumably I have to call MultiZarrToZarr
once for each variable (or group of variables) that needs to be concatenated along a common dimension. My real dataset is split along multiple dimensions and has multiple staggered grid locations for each dimension, meaning I have to call MultiZarrToZarr
something like 6 times.
This problem is analogous to what happens inside xarray.combine_by_coords
, which automatically groups variables into sets with common dimensions splits datasets up into sets consisting of the same variable from each dataset, concatenates each set separately (along multiple dimensions in general), then merges the results.
Is that approach (call MultiZarrToZarr
multiple times then call kerchunk.combine.merge_vars
) the recommended way to handle this case currently? Could we imagine some improvement to the kerchunk.combine
API that might make this easier?