Open
Description
Problem
Kerchunk user code currently passes around an obscure multiply-nested "reference dict" object. This is hard to read, interrogate, validate, or reason about.
Suggestion
Instead create a new VirtualZarrStore
dataclass, which contains all the same information that is currently stored in the reference dict but in a more structured manner. This would then be the principle object that gets passed around between user calls to kerchunk API.
Advantages
- Easier to read and interrogate than multiply-nested dicts
- Allows direct validation
- Serializes in obvious ways (via
.to_json
,to_parquet
,.to_dict
or similar.) - Easier to write tests, by using fixtures to generate
VirtualZarrStore
objects - Concentrates concerns over changes/enhancements to Zarr Spec in one class
- A v2->v3 converter could act directly on these objects
- Possibly easier to understand whenever anyone reimplements kerchunk in other languages?
Implementation ideas
- Implementation could subclass Zarr Object Model classes (where
.to_json
is analogous to the ZOM's.serialize
), which then would be solidified as the recommended abstract representation once ZEP006 is accepted - Can't use a bare ZOM class because we need to add some extra attributes for byte ranges etc. However information on where to find chunks is essentially a "Chunk Manifest", a generalizable idea that @jhamman has also been working on (for a nascent ZEP007??)
- Attributes of this dataclass need to always be serializable, so the
VirtualZarrStore
should be basically a json schema (see #373)
Questions
- Is it possible to do this in a broadly backwards-compatible manner?
Metadata
Metadata
Assignees
Labels
No labels