-
-
Notifications
You must be signed in to change notification settings - Fork 331
refactor v3 data types #2874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
refactor v3 data types #2874
Changes from all commits
f5e3f78
b4e71e2
3c50f54
d74e7a4
5000dcb
9cd5c51
042fac1
556e390
b588f70
4ed41c6
1b2c773
24930b3
703e0e1
3c232a4
b7fe986
d9b44b4
bf24d69
c1a8566
2868994
9ab0b1e
e9f5e26
6df84a9
e14279d
381a264
6a7857b
e8fd72c
b22f324
b7a231e
7dfcd0f
706e6b6
8fbf673
e9aff64
44e78f5
60cac04
120df57
0d9922b
2075952
44369d6
4f3381f
c8d7680
2a7b5a8
e855e54
a2da99a
5ea3fa4
cbb159d
c506d09
bb11867
7a619e0
ea2d0bf
042c9e5
def5eb2
1b7273b
60b2e9d
83f508c
4ceb6ed
5b9cff0
65f0453
cb0a7d4
40f0063
9989c64
a276c84
6285739
e9241b9
2bffe1a
aa32271
617d3f0
2b5fd8f
1831f20
a427a16
41d7e58
c08ffd9
778d740
269215e
8af0ce4
df60d05
7f54bbf
be83f03
3979746
a210f9f
8fbd29a
afc9872
e1bf901
45f0c88
890077e
a3f05f0
4788f05
d3f9204
fdf17e3
4afa42a
4990803
1458aad
9673997
aa11df4
f706b46
52518c2
4ab1c58
e4c89f3
e386c2b
703192c
0fab5e5
2f945bf
63a6af4
56e7c84
eee0d7b
1dc8e72
13ca230
2a42205
3f775c8
5320a77
b525b8e
ec94878
3af98aa
6388203
6ef7924
1329c69
d8c3672
3f4d87a
d8a382a
9aa751b
e4a0372
8a976d6
be0d2df
8c90d2c
0fc653f
7c58f7a
3a21845
ce0afe3
e67d4dc
4e2a157
a1deda6
528a942
c9c8181
1cb7734
d80d565
7806563
39219fa
4a7a550
807c585
5150d60
9ddbe97
d6535d6
42e14ef
3991406
d7da3d9
c3c3288
d1feaee
3ef138a
1f767e4
cf55041
24b6b35
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
Adds zarr-specific data type classes. This replaces the internal use of numpy data types for zarr | ||
v2 and a fixed set of string enums for zarr v3. This change is largely internal, but it does | ||
change the type of the ``dtype`` and ``data_type`` fields on the ``ArrayV2Metadata`` and | ||
``ArrayV3Metadata`` classes. It also changes the JSON metadata representation of the | ||
variable-length string data type, but the old metadata representation can still be | ||
used when reading arrays. The logic for automatically choosing the chunk encoding for a given data | ||
type has also changed, and this necessitated changes to the ``config`` API. | ||
|
||
For more on this new feature, see the `documentation </user-guide/data_types.html>`_ |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -43,39 +43,30 @@ This is the current default configuration:: | |
|
||
>>> zarr.config.pprint() | ||
{'array': {'order': 'C', | ||
'v2_default_compressor': {'bytes': {'checksum': False, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I manually set the config to this old default value (which I could do in the current v3 branch), does it work properly after this PR? I guess the bigger question here is, are there any breaking changes to what is/isn't allowed in the config with this PR? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no, the config in this PR has undergone breaking changes compared to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, in that case the release notes definitely need expanding a lot to explain what the breaking changes are. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My two cents on breaking changes is we should definitely deprecate where possible, because v3 was already a big breaking change that users (well, at least me 😄 ) are struggling to get used to, so to have more breaking changes without deprecations and migration paths would not be great. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed, we just need to sketch out how to do deprecations and and migrations in our (terrible, IMO) config API There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "terrible" is an exaggeration -- our config API works today, but it has some flaws that make me think it should be overhauled
I'm not sure how many of these things can be addressed within the scope of donfig itself? |
||
'id': 'zstd', | ||
'level': 0}, | ||
'numeric': {'checksum': False, | ||
'id': 'zstd', | ||
'level': 0}, | ||
'string': {'checksum': False, | ||
'v2_default_compressor': {'default': {'checksum': False, | ||
'id': 'zstd', | ||
'level': 0}}, | ||
'v2_default_filters': {'bytes': [{'id': 'vlen-bytes'}], | ||
'numeric': None, | ||
'raw': None, | ||
'string': [{'id': 'vlen-utf8'}]}, | ||
'v3_default_compressors': {'bytes': [{'configuration': {'checksum': False, | ||
'level': 0}, | ||
'name': 'zstd'}], | ||
'numeric': [{'configuration': {'checksum': False, | ||
'level': 0}, | ||
'variable-length-string': {'checksum': False, | ||
'id': 'zstd', | ||
'level': 0}}, | ||
'v2_default_filters': {'default': None, | ||
'variable-length-string': [{'id': 'vlen-utf8'}]}, | ||
'v3_default_compressors': {'default': [{'configuration': {'checksum': False, | ||
'level': 0}, | ||
'name': 'zstd'}], | ||
'string': [{'configuration': {'checksum': False, | ||
'level': 0}, | ||
'name': 'zstd'}]}, | ||
'v3_default_filters': {'bytes': [], 'numeric': [], 'string': []}, | ||
'v3_default_serializer': {'bytes': {'name': 'vlen-bytes'}, | ||
'numeric': {'configuration': {'endian': 'little'}, | ||
'name': 'bytes'}, | ||
'string': {'name': 'vlen-utf8'}}, | ||
'write_empty_chunks': False}, | ||
'async': {'concurrency': 10, 'timeout': None}, | ||
'buffer': 'zarr.core.buffer.cpu.Buffer', | ||
'codec_pipeline': {'batch_size': 1, | ||
'path': 'zarr.core.codec_pipeline.BatchedCodecPipeline'}, | ||
'codecs': {'blosc': 'zarr.codecs.blosc.BloscCodec', | ||
'variable-length-string': [{'configuration': {'checksum': False, | ||
'level': 0}, | ||
'name': 'zstd'}]}, | ||
'v3_default_filters': {'default': [], 'variable-length-string': []}, | ||
'v3_default_serializer': {'default': {'configuration': {'endian': 'little'}, | ||
'name': 'bytes'}, | ||
'variable-length-string': {'name': 'vlen-utf8'}}, | ||
'write_empty_chunks': False}, | ||
'async': {'concurrency': 10, 'timeout': None}, | ||
'buffer': 'zarr.core.buffer.cpu.Buffer', | ||
'codec_pipeline': {'batch_size': 1, | ||
'path': 'zarr.core.codec_pipeline.BatchedCodecPipeline'}, | ||
'codecs': {'blosc': 'zarr.codecs.blosc.BloscCodec', | ||
'bytes': 'zarr.codecs.bytes.BytesCodec', | ||
'crc32c': 'zarr.codecs.crc32c_.Crc32cCodec', | ||
'endian': 'zarr.codecs.bytes.BytesCodec', | ||
|
@@ -85,7 +76,7 @@ This is the current default configuration:: | |
'vlen-bytes': 'zarr.codecs.vlen_utf8.VLenBytesCodec', | ||
'vlen-utf8': 'zarr.codecs.vlen_utf8.VLenUTF8Codec', | ||
'zstd': 'zarr.codecs.zstd.ZstdCodec'}, | ||
'default_zarr_format': 3, | ||
'json_indent': 2, | ||
'ndbuffer': 'zarr.core.buffer.cpu.NDBuffer', | ||
'threading': {'max_workers': None}} | ||
'default_zarr_format': 3, | ||
'json_indent': 2, | ||
'ndbuffer': 'zarr.core.buffer.cpu.NDBuffer', | ||
'threading': {'max_workers': None}} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,172 @@ | ||
Data types | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This file is a super useful read. I'm wondering what to do with it though. Were you thinking it would go under the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No strong opinion from me. IMO our docs right now are not the most logically organized, so I anticipate some churn there in any case. |
||
========== | ||
|
||
Zarr's data type model | ||
---------------------- | ||
|
||
Every Zarr array has a "data type", which defines the meaning and physical layout of the | ||
array's elements. As Zarr Python is tightly integrated with `NumPy <https://numpy.org/doc/stable/>`_, | ||
it's easy to create arrays with NumPy data types: | ||
|
||
.. code-block:: python | ||
|
||
>>> import zarr | ||
>>> import numpy as np | ||
>>> z = zarr.create_array(store={}, shape=(10,), dtype=np.dtype('uint8')) | ||
>>> z | ||
<Array memory:... shape=(10,) dtype=uint8> | ||
|
||
Unlike NumPy arrays, Zarr arrays are designed to accessed by Zarr | ||
implementations in different programming languages. This means Zarr data types must be interpreted | ||
correctly when clients read an array. Each Zarr data type defines procedures for | ||
encoding and decoding both the data type itself, and scalars from that data type to and from Zarr array metadata. And these serialization procedures | ||
depend on the Zarr format. | ||
|
||
Data types in Zarr version 2 | ||
----------------------------- | ||
|
||
Version 2 of the Zarr format defined its data types relative to | ||
`NumPy's data types <https://numpy.org/doc/2.1/reference/arrays.dtypes.html#data-type-objects-dtype>`_, | ||
and added a few non-NumPy data types as well. Thus the JSON identifier for a NumPy-compatible data | ||
type is just the NumPy ``str`` attribute of that data type: | ||
|
||
.. code-block:: python | ||
|
||
>>> import zarr | ||
>>> import numpy as np | ||
>>> import json | ||
>>> | ||
>>> store = {} | ||
>>> np_dtype = np.dtype('int64') | ||
>>> z = zarr.create_array(store=store, shape=(1,), dtype=np_dtype, zarr_format=2) | ||
>>> dtype_meta = json.loads(store['.zarray'].to_bytes())["dtype"] | ||
>>> dtype_meta | ||
'<i8' | ||
>>> assert dtype_meta == np_dtype.str | ||
|
||
.. note:: | ||
The ``<`` character in the data type metadata encodes the | ||
`endianness <https://numpy.org/doc/2.2/reference/generated/numpy.dtype.byteorder.html>`_, | ||
or "byte order", of the data type. Following NumPy's example, | ||
in Zarr version 2 each data type has an endianness where applicable. | ||
However, Zarr version 3 data types do not store endianness information. | ||
|
||
In addition to defining a representation of the data type itself (which in the example above was | ||
just a simple string ``"<i8"``), Zarr also | ||
defines a metadata representation for scalars associated with each data type. This is necessary | ||
because Zarr arrays have a ``JSON``-serializable ``fill_value`` attribute that defines a scalar value to use when reading | ||
uninitialized chunks of a Zarr array. | ||
Integer and float scalars are stored as ``JSON`` numbers, except for special floats like ``NaN``, | ||
positive infinity, and negative infinity, which are stored as strings. | ||
|
||
More broadly, each Zarr data type defines its own rules for how scalars of that type are stored in | ||
``JSON``. | ||
|
||
|
||
Data types in Zarr version 3 | ||
----------------------------- | ||
|
||
Zarr V3 brings several key changes to how data types are represented: | ||
d-v-b marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- Zarr V3 identifies the basic data types as strings like ``"int8"``, ``"int16"``, etc. | ||
|
||
By contrast, Zarr V2 uses the NumPy character code representation for data types: | ||
In Zarr V2, ``int8`` is represented as ``"|i1"``. | ||
- A Zarr V3 data type does not have endianness. This is a departure from Zarr V2, where multi-byte | ||
data types are defined with endianness information. Instead, Zarr V3 requires that endianness, | ||
where applicable, is specified in the ``codecs`` attribute of array metadata. | ||
- While some Zarr V3 data types are identified by strings, others can be identified by a ``JSON`` | ||
object. For example, consider this specification of a ``datetime`` data type: | ||
|
||
.. code-block:: json | ||
|
||
{ | ||
"name": "numpy.datetime64", | ||
"configuration": { | ||
"unit": "s", | ||
"scale_factor": 10 | ||
} | ||
} | ||
|
||
|
||
Zarr V2 generally uses structured string representations to convey the same information. The | ||
data type given in the previous example would be represented as the string ``">M[10s]"`` in | ||
Zarr V2. This is more compact, but can be harder to parse. | ||
|
||
For more about data types in Zarr V3, see the | ||
`V3 specification <https://zarr-specs.readthedocs.io/en/latest/v3/data-types/index.html>`_. | ||
|
||
Data types in Zarr Python | ||
------------------------- | ||
|
||
The two Zarr formats that Zarr Python supports specify data types in two different ways: | ||
data types in Zarr version 2 are encoded as NumPy-compatible strings, while data types in Zarr version | ||
3 are encoded as either strings or ``JSON`` objects, | ||
and the Zarr V3 data types don't have any associated endianness information, unlike Zarr V2 data types. | ||
|
||
To abstract over these syntactical and semantic differences, Zarr Python uses a class called | ||
`ZDType <../api/zarr/dtype/index.html#zarr.dtype.ZDType>`_ provide Zarr V2 and Zarr V3 compatibility | ||
routines for ""native" data types. In this context, a "native" data type is a Python class, | ||
typically defined in another library, that models an array's data type. For example, ``np.uint8`` is a native | ||
data type defined in NumPy, which Zarr Python wraps with a ``ZDType`` instance called | ||
`UInt8 <../api/zarr/dtype/index.html#zarr.dtype.ZDType>`_. | ||
|
||
Each data type supported by Zarr Python is modeled by ``ZDType`` subclass, which provides an | ||
API for the following operations: | ||
|
||
- Wrapping / unwrapping a native data type | ||
- Encoding / decoding a data type to / from Zarr V2 and Zarr V3 array metadata. | ||
- Encoding / decoding a scalar value to / from Zarr V2 and Zarr V3 array metadata. | ||
|
||
|
||
Example Usage | ||
~~~~~~~~~~~~~ | ||
|
||
Create a ``ZDType`` from a native data type: | ||
|
||
.. code-block:: python | ||
|
||
>>> from zarr.core.dtype import Int8 | ||
>>> import numpy as np | ||
>>> int8 = Int8.from_native_dtype(np.dtype('int8')) | ||
|
||
Convert back to native data type: | ||
|
||
.. code-block:: python | ||
|
||
>>> native_dtype = int8.to_native_dtype() | ||
>>> assert native_dtype == np.dtype('int8') | ||
|
||
Get the default scalar value for the data type: | ||
|
||
.. code-block:: python | ||
|
||
>>> default_value = int8.default_scalar() | ||
>>> assert default_value == np.int8(0) | ||
|
||
|
||
Serialize to JSON for Zarr V2 and V3 | ||
|
||
.. code-block:: python | ||
|
||
>>> json_v2 = int8.to_json(zarr_format=2) | ||
>>> json_v2 | ||
'|i1' | ||
>>> json_v3 = int8.to_json(zarr_format=3) | ||
>>> json_v3 | ||
'int8' | ||
|
||
Serialize a scalar value to JSON: | ||
|
||
.. code-block:: python | ||
|
||
>>> json_value = int8.to_json_scalar(42, zarr_format=3) | ||
>>> json_value | ||
42 | ||
|
||
Deserialize a scalar value from JSON: | ||
|
||
.. code-block:: python | ||
|
||
>>> scalar_value = int8.from_json_scalar(42, zarr_format=3) | ||
>>> assert scalar_value == np.int8(42) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,6 +8,7 @@ User guide | |
|
||
installation | ||
arrays | ||
data_types | ||
groups | ||
attributes | ||
storage | ||
|
Uh oh!
There was an error while loading. Please reload this page.