Skip to content

Commit c4fbcea

Browse files
committed
Read small integers as float32, not float64
AKA the "I just wasted 4.6 TB of memory" patch.
1 parent f3deb2f commit c4fbcea

File tree

2 files changed

+19
-5
lines changed

2 files changed

+19
-5
lines changed

doc/whats-new.rst

+4-3
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,9 @@ Enhancements
5050
- :py:func:`~plot.line()` learned to draw multiple lines if provided with a
5151
2D variable.
5252
By `Deepak Cherian <https://github.com/dcherian>`_.
53+
- Reduce memory usage when decoding a variable with a scale_factor, by
54+
converting 8-bit and 16-bit integers to float32 instead of float64.
55+
By `Zac Hatfield-Dodds <https://github.com/Zac-HD>`_.
5356

5457
.. _Zarr: http://zarr.readthedocs.io/
5558

@@ -66,11 +69,9 @@ Bug fixes
6669
- Fixed encoding of multi-dimensional coordinates in
6770
:py:meth:`~Dataset.to_netcdf` (:issue:`1763`).
6871
By `Mike Neish <https://github.com/neishm>`_.
69-
7072
- Bug fix in open_dataset(engine='pydap') (:issue:`1775`)
7173
By `Keisuke Fujii <https://github.com/fujiisoup>`_.
72-
73-
- Bug fix in vectorized assignment (:issue:`1743`, `1744`).
74+
- Bug fix in vectorized assignment (:issue:`1743`, :issue:`1744`).
7475
Now item assignment to :py:meth:`~DataArray.__setitem__` checks
7576
- Bug fix in vectorized assignment (:issue:`1743`, :issue:`1744`).
7677
Now item assignment to :py:meth:`DataArray.__setitem__` checks

xarray/coding/variables.py

+15-2
Original file line numberDiff line numberDiff line change
@@ -212,11 +212,24 @@ class CFScaleOffsetCoder(VariableCoder):
212212
decode_values = encoded_values * scale_factor + add_offset
213213
"""
214214

215+
@staticmethod
216+
def _choose_float_dtype(data, attributes):
217+
# We default to en/decoding as float64, but for small integer types
218+
# and no offset it's usually safe to use float32, which saves a lot
219+
# of memory for eg. TB-scale satellite imagery collections.
220+
if data.dtype.itemsize <= 2 and \
221+
np.issubdtype(data.dtype, np.integer) and \
222+
'add_offset' not in attributes and \
223+
2 ** -23 < float(attributes.get('scale_factor', 1)) < 2 ** 8:
224+
return np.float32
225+
return np.float64
226+
215227
def encode(self, variable, name=None):
216228
dims, data, attrs, encoding = unpack_for_encoding(variable)
217229

218230
if 'scale_factor' in encoding or 'add_offset' in encoding:
219-
data = data.astype(dtype=np.float64, copy=True)
231+
dtype = self._choose_float_dtype(data, encoding)
232+
data = data.astype(dtype=dtype, copy=True)
220233
if 'add_offset' in encoding:
221234
data -= pop_to(encoding, attrs, 'add_offset', name=name)
222235
if 'scale_factor' in encoding:
@@ -230,7 +243,7 @@ def decode(self, variable, name=None):
230243
if 'scale_factor' in attrs or 'add_offset' in attrs:
231244
scale_factor = pop_to(attrs, encoding, 'scale_factor', name=name)
232245
add_offset = pop_to(attrs, encoding, 'add_offset', name=name)
233-
dtype = np.float64
246+
dtype = self._choose_float_dtype(data, attrs)
234247
transform = partial(_scale_offset_decoding,
235248
scale_factor=scale_factor,
236249
add_offset=add_offset,

0 commit comments

Comments
 (0)