encoding of boolean dtype in zarr #2937

rabernat · 2019-05-03T03:53:27Z

I want to store an array with 1364688000 boolean values in zarr. I will have to read this array many times, so I am trying to do it as efficiently as possible.

I have noticed that, if we try to write boolean data to zarr from xarray, zarr stores it as i8. ~~This means we are using 8x more memory than we actually need.~~
In researching this, I actually learned that numpy bools use a full byte of memory 😲!
However, we could still improve performance (albeit very marginally) by skipping the unnecessary dtype encoding that happens here.

Example

import xarray as xr
import zarr
for dtype in ['f8', 'i4', 'bool']:
    ds = xr.DataArray([1, 0]).astype(dtype).to_dataset('foo')
    store = {}
    ds.to_zarr(store)
    za = zarr.open(store)['foo']
    print(dtype, za.dtype, za.attrs.get('dtype'))

gives

f8 float64 None
i4 int32 None
bool int8 bool

So it seems like, during serialization of bool data, xarray is converting the data to int8 and then adding a {'dtype': 'bool'} to the attributes as encoding. When the data is read back, this gets decoded and the data is coerced back to bool.

Problem description

Since zarr is fully capable of storing bool data directly, we should not need to encode the data as i8.

I think this happens in encode_cf_variable:

xarray/xarray/conventions.py

Line 236 in 612d390

var = maybe_encode_bools(var)

which calls maybe_encode_bools:

xarray/xarray/conventions.py

Lines 105 to 112 in 612d390

    
           def maybe_encode_bools(var): 
        
               if ((var.dtype == np.bool) and 
        
                       ('dtype' not in var.encoding) and ('dtype' not in var.attrs)): 
        
                   dims, data, attrs, encoding = _var_as_tuple(var) 
        
                   attrs['dtype'] = 'bool' 
        
                   data = data.astype(dtype='i1', copy=True) 
        
                   var = Variable(dims, data, attrs, encoding) 
        
               return var

So maybe we make the boolean encoding optional?

Output of `xr.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.7 | packaged by conda-forge | (default, Feb 28 2019, 09:07:38)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-693.17.1.el7.centos.plus.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.8.18
libnetcdf: 4.4.1.1

xarray: 0.12.1
pandas: 0.20.3
numpy: 1.13.3
scipy: 1.1.0
netCDF4: 1.3.0
pydap: None
h5netcdf: 0.5.0
h5py: 2.7.1
Nio: None
zarr: 2.3.1
cftime: None
nc_time_axis: None
PseudonetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.2.1
dask: 0.19.0+3.g064ebb1
distributed: 1.21.8
matplotlib: 3.0.3
cartopy: 0.16.0
seaborn: 0.8.1
setuptools: 36.6.0
pip: 9.0.1
conda: None
pytest: 3.2.1
IPython: 6.2.1
sphinx: None

The text was updated successfully, but these errors were encountered:

shoyer · 2019-05-06T15:08:41Z

So maybe we make the boolean encoding optional?

Sounds good to me! We clearly don't need to be doing with data stored in zarr.

joshmoore · 2020-07-01T11:16:40Z

@rabernat : I don't assume you've found a workaround for this?

rabernat · 2020-07-01T19:14:35Z

My approach here was to use compression and filters to minimize the on-disk storage. Here is my .zarray for the dataset in question

{
    "chunks": [
        1,
        13,
        4320,
        4320
    ],
    "compressor": {
        "check": -1,
        "filters": [
            {
                "dist": 1,
                "id": 3
            },
            {
                "id": 33,
                "preset": 1
            }
        ],
        "format": 3,
        "id": "lzma",
        "preset": null
    },
    "dtype": "|i1",
    "fill_value": null,
    "filters": null,
    "order": "C",
    "shape": [
        90,
        13,
        4320,
        4320
    ],
    "zarr_format": 2

This does not solve the in-memory problem, but that's a numpy issue.

jhamman · 2025-03-20T06:23:09Z

With Zarr v3 (and Zarr-Python3), we ware now able to write native boolean types:

In [11]: import xarray as xr
    ...: import zarr
    ...: for dtype in ['f8', 'i4', 'bool']:
    ...:     ds = xr.DataArray([1, 0]).astype(dtype).to_dataset(name='foo')
    ...:     store = {}
    ...:     ds.to_zarr(store, consolidated=False)
    ...:     za = zarr.open(store)['foo']
    ...:     print(dtype, za.dtype, za.attrs.get('dtype'))
    ...: 
f8 float64 None
i4 int32 None
bool int8 bool

joshmoore mentioned this issue Jun 28, 2020

Introduce 6D mask storage ome/omero-cli-zarr#9

Merged

amatsukawa mentioned this issue Jan 19, 2021

Reading and writing a zarr dataset multiple times casts bools to int8 #4826

Closed

dcherian added the topic-zarr Related to zarr storage library label Apr 9, 2022

jhamman closed this as completed Mar 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding of boolean dtype in zarr #2937

encoding of boolean dtype in zarr #2937

rabernat commented May 3, 2019

INSTALLED VERSIONS

shoyer commented May 6, 2019

joshmoore commented Jul 1, 2020

rabernat commented Jul 1, 2020

jhamman commented Mar 20, 2025

encoding of boolean dtype in zarr #2937

encoding of boolean dtype in zarr #2937

Comments

rabernat commented May 3, 2019

Problem description

Output of xr.show_versions()

INSTALLED VERSIONS

shoyer commented May 6, 2019

joshmoore commented Jul 1, 2020

rabernat commented Jul 1, 2020

jhamman commented Mar 20, 2025

Output of `xr.show_versions()`