Skip to content

Commit 33e4ca5

Browse files
committed
resolves #27 by auto switching blosc behaviour
1 parent 8177101 commit 33e4ca5

File tree

9 files changed

+1104
-1950
lines changed

9 files changed

+1104
-1950
lines changed

docs/api/storage.rst

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,10 @@ Storage (``zarr.storage``)
22
==========================
33
.. module:: zarr.storage
44

5-
This module contains a single :class:`DirectoryStore` class providing a
6-
``MutableMapping`` interface to a directory on the file system.
7-
8-
Note that any object implementing the ``MutableMapping`` interface can be used
9-
as a Zarr array store.
5+
This module contains a single :class:`DirectoryStore` class providing
6+
a ``MutableMapping`` interface to a directory on the file
7+
system. However, note that any object implementing the
8+
``MutableMapping`` interface can be used as a Zarr array store.
109

1110
.. autofunction:: init_store
1211

docs/index.rst

Lines changed: 12 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,5 @@
11
.. zarr documentation master file, created by
22
sphinx-quickstart on Mon May 2 21:40:09 2016.
3-
You can adapt this file completely to your liking, but it should at least
4-
contain the root `toctree` directive.
53
64
Zarr
75
====
@@ -14,26 +12,16 @@ chunked, compressed, N-dimensional arrays.
1412
* Download: https://pypi.python.org/pypi/zarr
1513
* Release notes: https://github.com/alimanfoo/zarr/releases
1614

17-
Motivation
15+
Highlights
1816
----------
1917

20-
Zarr is motivated by the desire to work interactively with
21-
multi-dimensional scientific datasets too large to fit into memory on
22-
commodity desktop or laptop computers. Interactive data analysis
23-
requires fast array storage, because an interactive session may
24-
involve creation and manipulation of many intermediate data
25-
structures. Faster storage provides more freedom to explore a rich and
26-
complex dataset in a variety of different ways. The Blosc compression
27-
library provides extremely fast multi-threaded compression and
28-
decompression, and so a primary motivation for Zarr was to bring
29-
together Blosc with multi-dimensional arrays in a convenient way.
30-
31-
A second motivation is to provide array storage that is convenient and
32-
well-suited to use in parallel computations. This means supporting
33-
concurrent data access from multiple threads or processes, without
34-
unnecessary locking or exclusion, to maximise the possibility for work
35-
to be carried out in parallel.
36-
18+
* Create N-dimensional arrays with any NumPy dtype.
19+
* Chunk arrays along any dimension.
20+
* Compress chunks using the fast Blosc_ meta-compressor or alternatively using zlib, BZ2 or LZMA.
21+
* Store arrays in memory, on disk, inside a Zip file, on S3, ... pretty much anywhere you like.
22+
* Read an array concurrently from multiple threads or processes.
23+
* Write to an array concurrently from multiple threads or processes.
24+
3725
Status
3826
------
3927

@@ -79,7 +67,8 @@ Acknowledgments
7967
Zarr bundles the `c-blosc <https://github.com/Blosc/c-blosc>`_
8068
library and uses it as the default compressor.
8169

82-
Zarr is inspired by and borrows code from `bcolz <http://bcolz.blosc.org/>`_.
70+
Zarr is inspired by `HDF5 <https://www.hdfgroup.org/HDF5/>`_, `h5py
71+
<http://www.h5py.org/>`_ and `bcolz <http://bcolz.blosc.org/>`_.
8372

8473
Development of this package is supported by the
8574
`MRC Centre for Genomics and Global Health <http://www.cggh.org>`_.
@@ -90,3 +79,5 @@ Indices and tables
9079
* :ref:`genindex`
9180
* :ref:`modindex`
9281
* :ref:`search`
82+
83+
.. _Blosc: http://www.blosc.org/

docs/spec/v1.rst

Lines changed: 102 additions & 89 deletions
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,42 @@
11
Zarr storage specification version 1
22
====================================
33

4-
This document provides a technical specification of the format used for
5-
storing a Zarr array. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL",
6-
"SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
7-
this document are to be interpreted as described in
8-
`RFC 2119 <https://www.ietf.org/rfc/rfc2119.txt>`_.
4+
This document provides a technical specification of the format used
5+
for storing a Zarr array. The key words "MUST", "MUST NOT",
6+
"REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT",
7+
"RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
8+
interpreted as described in `RFC 2119
9+
<https://www.ietf.org/rfc/rfc2119.txt>`_.
910

1011
Storage
1112
-------
1213

13-
A Zarr array can be stored in any storage system that provides a key/value
14-
interface, where a key is an ASCII string and a value is an arbitrary
15-
sequence of bytes, and the supported operations are read (get the sequence
16-
of bytes associated with a given key), write (set the sequence of bytes
17-
associated with a given key) and delete (remove a key/value pair).
14+
A Zarr array can be stored in any storage system that provides a
15+
key/value interface, where a key is an ASCII string and a value is an
16+
arbitrary sequence of bytes, and the supported operations are read
17+
(get the sequence of bytes associated with a given key), write (set
18+
the sequence of bytes associated with a given key) and delete (remove
19+
a key/value pair).
1820

19-
For example, a directory in a file system can provide this interface, where
20-
keys are file names, values are file contents, and files can be read, written
21-
or deleted. Similarly, an S3 bucket can provide this interface, where
22-
keys are resource names, values are resource contents, and resources can be
23-
read, written or deleted via HTTP.
21+
For example, a directory in a file system can provide this interface,
22+
where keys are file names, values are file contents, and files can be
23+
read, written or deleted. Equally, an S3 bucket can provide this
24+
interface, where keys are resource names, values are resource
25+
contents, and resources can be read, written or deleted via HTTP.
2426

25-
Below an "array store" refers to any system implementing this interface.
27+
Below an "array store" refers to any system implementing this
28+
interface.
2629

2730
Metadata
2831
--------
2932

30-
Each array requires essential configuration metadata to be stored, enabling
31-
correct interpretation of the stored data. This metadata is encoded using
32-
JSON and stored as the value of the 'meta' key within an array store.
33+
Each array requires essential configuration metadata to be stored,
34+
enabling correct interpretation of the stored data. This metadata is
35+
encoded using JSON and stored as the value of the 'meta' key within an
36+
array store.
3337

34-
The metadata resource is a JSON object. The following keys MUST be present
35-
within the object:
38+
The metadata resource is a JSON object. The following keys MUST be
39+
present within the object:
3640

3741
zarr_format
3842
An integer defining the version of the storage specification to which the
@@ -59,15 +63,15 @@ order
5963
array. 'C' means row-major order, i.e., the last dimension varies fastest;
6064
'F' means column-major order, i.e., the first dimension varies fastest.
6165

62-
Other keys MAY be present within the metadata object however they MUST NOT
63-
alter the interpretation of the required fields defined above.
66+
Other keys MAY be present within the metadata object however they MUST
67+
NOT alter the interpretation of the required fields defined above.
6468

65-
For example, the JSON object below defines a 2-dimensional array of 64-bit
66-
little-endian floating point numbers with 10000 rows and 10000 columns,
67-
divided into chunks of 1000 rows and 1000 columns (so there will be 100
68-
chunks in total arranged in a 10 by 10 grid). Within each chunk the data
69-
are laid out in C contiguous order, and each chunk is compressed using the
70-
Blosc compression library::
69+
For example, the JSON object below defines a 2-dimensional array of
70+
64-bit little-endian floating point numbers with 10000 rows and 10000
71+
columns, divided into chunks of 1000 rows and 1000 columns (so there
72+
will be 100 chunks in total arranged in a 10 by 10 grid). Within each
73+
chunk the data are laid out in C contiguous order, and each chunk is
74+
compressed using the Blosc compression library::
7175

7276
{
7377
"chunks": [
@@ -94,33 +98,36 @@ Data type encoding
9498
~~~~~~~~~~~~~~~~~~
9599

96100
Simple data types are encoded within the array metadata resource as a
97-
string, following the `NumPy array protocol type string (typestr) format
101+
string, following the `NumPy array protocol type string (typestr)
102+
format
98103
<http://docs.scipy.org/doc/numpy/reference/arrays.interface.html>`_. The
99-
format consists of 3 parts: a character describing the byteorder of the
100-
data (``<``: little-endian, ``>``: big-endian, ``|``: not-relevant), a
101-
character code giving the basic type of the array, and an integer providing
102-
the number of bytes the type uses. The byte order MUST be specified. E.g.,
103-
``"<f8"``, ``">i4"``, ``"|b1"`` and ``"|S12"`` are valid data types.
104-
105-
Structure data types (i.e., with multiple named fields) are encoded as a
106-
list of two-element lists, following `NumPy array protocol type descriptions
107-
(descr) <http://docs.scipy.org/doc/numpy/reference/arrays.interface.html#>`_.
108-
For example, the JSON list ``[["r", "|u1"], ["g", "|u1"], ["b", "|u1"]]``
109-
defines a data type composed of three single-byte unsigned integers labelled
110-
'r', 'g' and 'b'.
104+
format consists of 3 parts: a character describing the byteorder of
105+
the data (``<``: little-endian, ``>``: big-endian, ``|``:
106+
not-relevant), a character code giving the basic type of the array,
107+
and an integer providing the number of bytes the type uses. The byte
108+
order MUST be specified. E.g., ``"<f8"``, ``">i4"``, ``"|b1"`` and
109+
``"|S12"`` are valid data types.
110+
111+
Structure data types (i.e., with multiple named fields) are encoded as
112+
a list of two-element lists, following `NumPy array protocol type
113+
descriptions (descr)
114+
<http://docs.scipy.org/doc/numpy/reference/arrays.interface.html#>`_.
115+
For example, the JSON list ``[["r", "|u1"], ["g", "|u1"], ["b",
116+
"|u1"]]`` defines a data type composed of three single-byte unsigned
117+
integers labelled 'r', 'g' and 'b'.
111118

112119
Chunks
113120
------
114121

115-
Each chunk of the array is compressed by passing the raw bytes for the chunk
116-
through the primary compression library to obtain a new sequence of bytes
117-
comprising the compressed chunk data. No header is added to the compressed
118-
bytes or any other modification made. The internal structure of the
119-
compressed bytes will depend on which primary compressor was used. For
120-
example, the
121-
`Blosc compressor <https://github.com/Blosc/c-blosc/blob/master/README_HEADER.rst>`_
122-
produces a sequence of bytes that begins with a 16-byte header followed by
123-
compressed data.
122+
Each chunk of the array is compressed by passing the raw bytes for the
123+
chunk through the primary compression library to obtain a new sequence
124+
of bytes comprising the compressed chunk data. No header is added to
125+
the compressed bytes or any other modification made. The internal
126+
structure of the compressed bytes will depend on which primary
127+
compressor was used. For example, the `Blosc compressor
128+
<https://github.com/Blosc/c-blosc/blob/master/README_HEADER.rst>`_
129+
produces a sequence of bytes that begins with a 16-byte header
130+
followed by compressed data.
124131

125132
The compressed sequence of bytes for each chunk is stored under a key
126133
formed from the index of the chunk within the grid of chunks
@@ -133,28 +140,30 @@ data for rows 0-1000 and columns 0-1000 and is stored under the key
133140
'0.0'; the chunk with indices (2, 4) provides data for rows 2000-3000
134141
and columns 4000-5000 and is stored under the key '2.4'; etc.
135142

136-
There is no need for all chunks to be present within an array store. If a
137-
chunk is not present then it is considered to be in an uninitialized state.
138-
An unitialized chunk MUST be treated as if it was uniformly filled with the
139-
value of the 'fill_value' field in the array metadata. If the 'fill_value'
140-
field is ``null`` then the contents of the chunk are undefined.
143+
There is no need for all chunks to be present within an array
144+
store. If a chunk is not present then it is considered to be in an
145+
uninitialized state. An unitialized chunk MUST be treated as if it
146+
was uniformly filled with the value of the 'fill_value' field in the
147+
array metadata. If the 'fill_value' field is ``null`` then the
148+
contents of the chunk are undefined.
141149

142-
Note that all chunks in array have the same shape. If the length of any
143-
array dimension is not exactly divisible by the length of the corresponding
144-
chunk dimension then some chunks will overhang the edge of the array. The
145-
contents of any chunk region falling outside the array are undefined.
150+
Note that all chunks in array have the same shape. If the length of
151+
any array dimension is not exactly divisible by the length of the
152+
corresponding chunk dimension then some chunks will overhang the edge
153+
of the array. The contents of any chunk region falling outside the
154+
array are undefined.
146155

147156
Attributes
148157
----------
149158

150-
Each array can also be associated with custom attributes, which are simple
151-
key/value items with application-specific meaning. Custom attributes are
152-
encoded as a JSON object and stored under the 'attrs' key within an array
153-
store. Even if the attributes are empty, the 'attrs' key MUST be present
154-
within an array store.
159+
Each array can also be associated with custom attributes, which are
160+
simple key/value items with application-specific meaning. Custom
161+
attributes are encoded as a JSON object and stored under the 'attrs'
162+
key within an array store. Even if the attributes are empty, the
163+
'attrs' key MUST be present within an array store.
155164

156-
For example, the JSON object below encodes three attributes named 'foo', 'bar'
157-
and 'baz'::
165+
For example, the JSON object below encodes three attributes named
166+
'foo', 'bar' and 'baz'::
158167

159168
{
160169
"foo": 42,
@@ -165,13 +174,16 @@ and 'baz'::
165174
Example
166175
-------
167176

168-
Below is an example of storing a Zarr array within a directory called
169-
'example.zarr' on the local file system::
177+
Below is an example of storing a Zarr array, using a directory on the
178+
local file system as storage.
179+
180+
Initialize the store::
170181

171182
>>> import zarr
172-
>>> z = zarr.open('example.zarr', mode='w', shape=(20, 20),
173-
... chunks=(10, 10), dtype='i4', fill_value=42,
174-
... compression='zlib', compression_opts=1)
183+
>>> store = zarr.DirectoryStore('example.zarr')
184+
>>> zarr.init_store(store, shape=(20, 20), chunks=(10, 10),
185+
... dtype='i4', fill_value=42, compression='zlib',
186+
... compression_opts=1, overwrite=True)
175187

176188
No chunks are initialized yet, so only the 'meta' and 'attrs' keys are
177189
present::
@@ -205,25 +217,9 @@ Inspect the array attributes::
205217
>>> print(open('example.zarr/attrs').read())
206218
{}
207219

208-
Modify the array attributes::
209-
210-
>>> z.attrs['foo'] = 42
211-
>>> z.attrs['bar'] = 'apples'
212-
>>> z.attrs['baz'] = [1, 2, 3, 4]
213-
>>> print(open('example.zarr/attrs').read())
214-
{
215-
"bar": "apples",
216-
"baz": [
217-
1,
218-
2,
219-
3,
220-
4
221-
],
222-
"foo": 42
223-
}
224-
225220
Set some data::
226221

222+
>>> z = zarr.Array(store)
227223
>>> z[0:10, 0:10] = 1
228224
>>> sorted(os.listdir('example.zarr'))
229225
['0.0', 'attrs', 'meta']
@@ -247,3 +243,20 @@ Manually decompress a single chunk for illustration::
247243
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
248244
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
249245
1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
246+
247+
Modify the array attributes::
248+
249+
>>> z.attrs['foo'] = 42
250+
>>> z.attrs['bar'] = 'apples'
251+
>>> z.attrs['baz'] = [1, 2, 3, 4]
252+
>>> print(open('example.zarr/attrs').read())
253+
{
254+
"bar": "apples",
255+
"baz": [
256+
1,
257+
2,
258+
3,
259+
4
260+
],
261+
"foo": 42
262+
}

0 commit comments

Comments
 (0)