zarr-developers · d-v-b · Feb 21, 2025 · Feb 24, 2025 · Feb 24, 2025 · Feb 26, 2025
diff --git a/changes/2874.feature.rst b/changes/2874.feature.rst
@@ -0,0 +1,9 @@
+Adds zarr-specific data type classes. This replaces the internal use of numpy data types for zarr
+v2 and a fixed set of string enums for zarr v3. This change is largely internal, but it does
+change the type of the ``dtype`` and ``data_type`` fields on the ``ArrayV2Metadata`` and
+``ArrayV3Metadata`` classes. It also changes the JSON metadata representation of the
+variable-length string data type, but the old metadata representation can still be
+used when reading arrays. The logic for automatically choosing the chunk encoding for a given data
+type has also changed, and this necessitated changes to the ``config`` API.
+
+For more on this new feature, see the `documentation </user-guide/data_types.html>`_
diff --git a/docs/user-guide/arrays.rst b/docs/user-guide/arrays.rst
@@ -182,7 +182,7 @@ which can be used to print useful diagnostics, e.g.::
    >>> z.info
    Type               : Array
    Zarr format        : 3
-   Data type          : DataType.int32
+   Data type          : Int32(endianness='little')
    Fill value         : 0
    Shape              : (10000, 10000)
    Chunk shape        : (1000, 1000)
@@ -200,7 +200,7 @@ prints additional diagnostics, e.g.::
    >>> z.info_complete()
    Type               : Array
    Zarr format        : 3
-   Data type          : DataType.int32
+   Data type          : Int32(endianness='little')
    Fill value         : 0
    Shape              : (10000, 10000)
    Chunk shape        : (1000, 1000)
@@ -248,7 +248,7 @@ built-in delta filter::
 The default compressor can be changed by setting the value of the using Zarr's
 :ref:`user-guide-config`, e.g.::
 
-   >>> with zarr.config.set({'array.v2_default_compressor.numeric': {'id': 'blosc'}}):
+   >>> with zarr.config.set({'array.v2_default_compressor.default': {'id': 'blosc'}}):
    ...     z = zarr.create_array(store={}, shape=(100000000,), chunks=(1000000,), dtype='int32', zarr_format=2)
    >>> z.filters
    ()
@@ -288,7 +288,7 @@ Here is an example using a delta filter with the Blosc compressor::
    >>> z.info
    Type               : Array
    Zarr format        : 3
-   Data type          : DataType.int32
+   Data type          : Int32(endianness='little')
    Fill value         : 0
    Shape              : (10000, 10000)
    Chunk shape        : (1000, 1000)
@@ -603,7 +603,7 @@ Sharded arrays can be created by providing the ``shards`` parameter to :func:`za
   >>> a.info_complete()
   Type               : Array
   Zarr format        : 3
-  Data type          : DataType.uint8
+  Data type          : UInt8()
   Fill value         : 0
   Shape              : (10000, 10000)
   Shard shape        : (1000, 1000)
@@ -612,10 +612,10 @@ Sharded arrays can be created by providing the ``shards`` parameter to :func:`za
   Read-only          : False
   Store type         : LocalStore
   Filters            : ()
-  Serializer         : BytesCodec(endian=<Endian.little: 'little'>)
+  Serializer         : BytesCodec(endian=None)
   Compressors        : (ZstdCodec(level=0, checksum=False),)
   No. bytes          : 100000000 (95.4M)
-  No. bytes stored   : 3981552
+  No. bytes stored   : 3981473
   Storage ratio      : 25.1
   Shards Initialized : 100
 

diff --git a/docs/user-guide/config.rst b/docs/user-guide/config.rst
@@ -43,39 +43,30 @@ This is the current default configuration::
 
    >>> zarr.config.pprint()
    {'array': {'order': 'C',
-              'v2_default_compressor': {'bytes': {'checksum': False,
-                                                  'id': 'zstd',
-                                                  'level': 0},
-                                        'numeric': {'checksum': False,
-                                                    'id': 'zstd',
-                                                    'level': 0},
-                                        'string': {'checksum': False,
+            'v2_default_compressor': {'default': {'checksum': False,
                                                    'id': 'zstd',
-                                                   'level': 0}},
-              'v2_default_filters': {'bytes': [{'id': 'vlen-bytes'}],
-                                     'numeric': None,
-                                     'raw': None,
-                                     'string': [{'id': 'vlen-utf8'}]},
-              'v3_default_compressors': {'bytes': [{'configuration': {'checksum': False,
-                                                                      'level': 0},
-                                                    'name': 'zstd'}],
-                                         'numeric': [{'configuration': {'checksum': False,
+                                                   'level': 0},
+                                       'variable-length-string': {'checksum': False,
+                                                                  'id': 'zstd',
+                                                                  'level': 0}},
+            'v2_default_filters': {'default': None,
+                                    'variable-length-string': [{'id': 'vlen-utf8'}]},
+            'v3_default_compressors': {'default': [{'configuration': {'checksum': False,
                                                                         'level': 0},
                                                       'name': 'zstd'}],
-                                         'string': [{'configuration': {'checksum': False,
-                                                                       'level': 0},
-                                                     'name': 'zstd'}]},
-              'v3_default_filters': {'bytes': [], 'numeric': [], 'string': []},
-              'v3_default_serializer': {'bytes': {'name': 'vlen-bytes'},
-                                        'numeric': {'configuration': {'endian': 'little'},
-                                                    'name': 'bytes'},
-                                        'string': {'name': 'vlen-utf8'}},
-              'write_empty_chunks': False},
-    'async': {'concurrency': 10, 'timeout': None},
-    'buffer': 'zarr.core.buffer.cpu.Buffer',
-    'codec_pipeline': {'batch_size': 1,
-                       'path': 'zarr.core.codec_pipeline.BatchedCodecPipeline'},
-    'codecs': {'blosc': 'zarr.codecs.blosc.BloscCodec',
+                                       'variable-length-string': [{'configuration': {'checksum': False,
+                                                                                       'level': 0},
+                                                                     'name': 'zstd'}]},
+            'v3_default_filters': {'default': [], 'variable-length-string': []},
+            'v3_default_serializer': {'default': {'configuration': {'endian': 'little'},
+                                                   'name': 'bytes'},
+                                       'variable-length-string': {'name': 'vlen-utf8'}},
+            'write_empty_chunks': False},
+   'async': {'concurrency': 10, 'timeout': None},
+   'buffer': 'zarr.core.buffer.cpu.Buffer',
+   'codec_pipeline': {'batch_size': 1,
+                     'path': 'zarr.core.codec_pipeline.BatchedCodecPipeline'},
+   'codecs': {'blosc': 'zarr.codecs.blosc.BloscCodec',
                'bytes': 'zarr.codecs.bytes.BytesCodec',
                'crc32c': 'zarr.codecs.crc32c_.Crc32cCodec',
                'endian': 'zarr.codecs.bytes.BytesCodec',
@@ -85,7 +76,7 @@ This is the current default configuration::
                'vlen-bytes': 'zarr.codecs.vlen_utf8.VLenBytesCodec',
                'vlen-utf8': 'zarr.codecs.vlen_utf8.VLenUTF8Codec',
                'zstd': 'zarr.codecs.zstd.ZstdCodec'},
-    'default_zarr_format': 3,
-    'json_indent': 2,
-    'ndbuffer': 'zarr.core.buffer.cpu.NDBuffer',
-    'threading': {'max_workers': None}}
+   'default_zarr_format': 3,
+   'json_indent': 2,
+   'ndbuffer': 'zarr.core.buffer.cpu.NDBuffer',
+   'threading': {'max_workers': None}}
diff --git a/docs/user-guide/consolidated_metadata.rst b/docs/user-guide/consolidated_metadata.rst
@@ -47,7 +47,7 @@ that can be used.:
    >>> from pprint import pprint
    >>> pprint(dict(sorted(consolidated_metadata.items())))
    {'a': ArrayV3Metadata(shape=(1,),
-                          data_type=<DataType.float64: 'float64'>,
+                          data_type=Float64(endianness='little'),
                           chunk_grid=RegularChunkGrid(chunk_shape=(1,)),
                           chunk_key_encoding=DefaultChunkKeyEncoding(name='default',
                                                                      separator='/'),
@@ -60,7 +60,7 @@ that can be used.:
                           node_type='array',
                           storage_transformers=()),
      'b': ArrayV3Metadata(shape=(2, 2),
-                          data_type=<DataType.float64: 'float64'>,
+                          data_type=Float64(endianness='little'),
                           chunk_grid=RegularChunkGrid(chunk_shape=(2, 2)),
                           chunk_key_encoding=DefaultChunkKeyEncoding(name='default',
                                                                      separator='/'),
@@ -73,7 +73,7 @@ that can be used.:
                           node_type='array',
                           storage_transformers=()),
      'c': ArrayV3Metadata(shape=(3, 3, 3),
-                          data_type=<DataType.float64: 'float64'>,
+                          data_type=Float64(endianness='little'),
                           chunk_grid=RegularChunkGrid(chunk_shape=(3, 3, 3)),
                           chunk_key_encoding=DefaultChunkKeyEncoding(name='default',
                                                                      separator='/'),

diff --git a/docs/user-guide/data_types.rst b/docs/user-guide/data_types.rst
@@ -0,0 +1,172 @@
+Data types
+==========
+
+Zarr's data type model
+----------------------
+
+Every Zarr array has a "data type", which defines the meaning and physical layout of the
+array's elements. As Zarr Python is tightly integrated with `NumPy <https://numpy.org/doc/stable/>`_,
+it's easy to create arrays with NumPy data types:
+
+.. code-block:: python
+
+  >>> import zarr
+  >>> import numpy as np
+  >>> z = zarr.create_array(store={}, shape=(10,), dtype=np.dtype('uint8'))
+  >>> z
+  <Array memory:... shape=(10,) dtype=uint8>
+
+Unlike NumPy arrays, Zarr arrays are designed to accessed by Zarr
+implementations in different programming languages. This means Zarr data types must be interpreted
+correctly when clients read an array. Each Zarr data type defines procedures for
+encoding and decoding both the data type itself, and scalars from that data type to and from Zarr array metadata. And these serialization procedures
+depend on the Zarr format.
+
+Data types in Zarr version 2
+-----------------------------
+
+Version 2 of the Zarr format defined its data types relative to
+`NumPy's data types <https://numpy.org/doc/2.1/reference/arrays.dtypes.html#data-type-objects-dtype>`_,
+and added a few non-NumPy data types as well. Thus the JSON identifier for a NumPy-compatible data
+type is just the NumPy ``str`` attribute of that data type:
+
+.. code-block:: python
+
+  >>> import zarr
+  >>> import numpy as np
+  >>> import json
+  >>>
+  >>> store = {}
+  >>> np_dtype = np.dtype('int64')
+  >>> z = zarr.create_array(store=store, shape=(1,), dtype=np_dtype, zarr_format=2)
+  >>> dtype_meta = json.loads(store['.zarray'].to_bytes())["dtype"]
+  >>> dtype_meta
+  '<i8'
+  >>> assert dtype_meta == np_dtype.str
+
+.. note::
+   The ``<`` character in the data type metadata encodes the
+   `endianness <https://numpy.org/doc/2.2/reference/generated/numpy.dtype.byteorder.html>`_,
+   or "byte order", of the data type. Following NumPy's example,
+   in Zarr version 2 each data type has an endianness where applicable.
+   However, Zarr version 3 data types do not store endianness information.
+
+In addition to defining a representation of the data type itself (which in the example above was
+just a simple string ``"<i8"``), Zarr also
+defines a metadata representation for scalars associated with each data type. This is necessary
+because Zarr arrays have a ``JSON``-serializable ``fill_value`` attribute that defines a scalar value to use when reading
+uninitialized chunks of a Zarr array.
+Integer and float scalars are stored as ``JSON`` numbers, except for special floats like ``NaN``,
+positive infinity, and negative infinity, which are stored as strings.
+
+More broadly, each Zarr data type defines its own rules for how scalars of that type are stored in
+``JSON``.
+
+
+Data types in Zarr version 3
+-----------------------------
+
+Zarr V3 brings several key changes to how data types are represented:
+
+- Zarr V3 identifies the basic data types as strings like ``"int8"``, ``"int16"``, etc.
+
+  By contrast, Zarr V2 uses the NumPy character code representation for data types:
+  In Zarr V2, ``int8`` is represented as ``"|i1"``.
+- A Zarr V3 data type does not have endianness. This is a departure from Zarr V2, where multi-byte
+  data types are defined with endianness information. Instead, Zarr V3 requires that endianness,
+  where applicable, is specified in the ``codecs`` attribute of array metadata.
+- While some Zarr V3 data types are identified by strings, others can be identified by a ``JSON``
+  object. For example, consider this specification of a ``datetime`` data type:
+
+  .. code-block:: json
+
+    {
+      "name": "numpy.datetime64",
+      "configuration": {
+          "unit": "s",
+          "scale_factor": 10
+        }
+    }
+
+
+  Zarr V2 generally uses structured string representations to convey the same information. The
+  data type given in the previous example would be represented as the string ``">M[10s]"`` in
+  Zarr V2. This is more compact, but can be harder to parse.
+
+For more about data types in Zarr V3, see the
+`V3 specification <https://zarr-specs.readthedocs.io/en/latest/v3/data-types/index.html>`_.
+
+Data types in Zarr Python
+-------------------------
+
+The two Zarr formats that Zarr Python supports specify data types in two different ways:
+data types in Zarr version 2 are encoded as NumPy-compatible strings, while data types in Zarr version
+3 are encoded as either strings or ``JSON`` objects,
+and the Zarr V3 data types don't have any associated endianness information, unlike Zarr V2 data types.
+
+To abstract over these syntactical and semantic differences, Zarr Python uses a class called
+`ZDType <../api/zarr/dtype/index.html#zarr.dtype.ZDType>`_ provide Zarr V2 and Zarr V3 compatibility
+routines for ""native" data types. In this context, a "native" data type is a Python class,
+typically defined in another library, that models an array's data type. For example, ``np.uint8`` is a native
+data type defined in NumPy, which Zarr Python wraps with a ``ZDType`` instance called
+`UInt8 <../api/zarr/dtype/index.html#zarr.dtype.ZDType>`_.
+
+Each data type supported by Zarr Python is modeled by ``ZDType`` subclass, which provides an
+API for the following operations:
+
+- Wrapping / unwrapping a native data type
+- Encoding / decoding a data type to / from Zarr V2 and Zarr V3 array metadata.
+- Encoding / decoding a scalar value to / from Zarr V2 and Zarr V3 array metadata.
+
+
+Example Usage
+~~~~~~~~~~~~~
+
+Create a ``ZDType`` from a native data type:
+
+.. code-block:: python
+
+  >>> from zarr.core.dtype import Int8
+  >>> import numpy as np
+  >>> int8 = Int8.from_native_dtype(np.dtype('int8'))
+
+Convert back to native data type:
+
+.. code-block:: python
+
+  >>> native_dtype = int8.to_native_dtype()
+  >>> assert native_dtype == np.dtype('int8')
+
+Get the default scalar value for the data type:
+
+.. code-block:: python
+
+  >>> default_value = int8.default_scalar()
+  >>> assert default_value == np.int8(0)
+
+
+Serialize to JSON for Zarr V2 and V3
+
+.. code-block:: python
+
+  >>> json_v2 = int8.to_json(zarr_format=2)
+  >>> json_v2
+  '|i1'
+  >>> json_v3 = int8.to_json(zarr_format=3)
+  >>> json_v3
+  'int8'
+
+Serialize a scalar value to JSON:
+
+.. code-block:: python
+
+  >>> json_value = int8.to_json_scalar(42, zarr_format=3)
+  >>> json_value
+  42
+
+Deserialize a scalar value from JSON:
+
+.. code-block:: python
+
+  >>> scalar_value = int8.from_json_scalar(42, zarr_format=3)
+  >>> assert scalar_value == np.int8(42)
diff --git a/docs/user-guide/groups.rst b/docs/user-guide/groups.rst
@@ -128,7 +128,7 @@ property. E.g.::
    >>> bar.info_complete()
    Type               : Array
    Zarr format        : 3
-   Data type          : DataType.int64
+   Data type          : Int64(endianness='little')
    Fill value         : 0
    Shape              : (1000000,)
    Chunk shape        : (100000,)
@@ -145,7 +145,7 @@ property. E.g.::
    >>> baz.info
    Type               : Array
    Zarr format        : 3
-   Data type          : DataType.float32
+   Data type          : Float32(endianness='little')
    Fill value         : 0.0
    Shape              : (1000, 1000)
    Chunk shape        : (100, 100)

diff --git a/docs/user-guide/index.rst b/docs/user-guide/index.rst
@@ -8,6 +8,7 @@ User guide
 
     installation
     arrays
+    data_types
     groups
     attributes
     storage
-Original file line number
+Diff line change
@@ Expand Up / @@ -8,6 +8,7 @@ User guide @@
         installation
         arrays
+        data_types
         groups
         attributes
         storage
@@ Expand Down @@