Skip to content

Dataclass for "VirtualZarrStore" #375

Open
@TomNicholas

Description

@TomNicholas

Problem

Kerchunk user code currently passes around an obscure multiply-nested "reference dict" object. This is hard to read, interrogate, validate, or reason about.

Suggestion

Instead create a new VirtualZarrStore dataclass, which contains all the same information that is currently stored in the reference dict but in a more structured manner. This would then be the principle object that gets passed around between user calls to kerchunk API.

Advantages

  • Easier to read and interrogate than multiply-nested dicts
  • Allows direct validation
  • Serializes in obvious ways (via .to_json, to_parquet, .to_dict or similar.)
  • Easier to write tests, by using fixtures to generate VirtualZarrStore objects
  • Concentrates concerns over changes/enhancements to Zarr Spec in one class
  • A v2->v3 converter could act directly on these objects
  • Possibly easier to understand whenever anyone reimplements kerchunk in other languages?

Implementation ideas

  • Implementation could subclass Zarr Object Model classes (where .to_json is analogous to the ZOM's .serialize), which then would be solidified as the recommended abstract representation once ZEP006 is accepted
  • Can't use a bare ZOM class because we need to add some extra attributes for byte ranges etc. However information on where to find chunks is essentially a "Chunk Manifest", a generalizable idea that @jhamman has also been working on (for a nascent ZEP007??)
  • Attributes of this dataclass need to always be serializable, so the VirtualZarrStore should be basically a json schema (see #373)

Questions

  • Is it possible to do this in a broadly backwards-compatible manner?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions