Skip to content

Commit f659398

Browse files
committed
Read small integers as float32, not float64
AKA the "I just wasted 4.6 TB of memory" patch.
1 parent f2ea7b6 commit f659398

File tree

2 files changed

+22
-5
lines changed

2 files changed

+22
-5
lines changed

doc/whats-new.rst

+4-3
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,9 @@ Enhancements
5151
- :py:func:`~plot.line()` learned to draw multiple lines if provided with a
5252
2D variable.
5353
By `Deepak Cherian <https://github.com/dcherian>`_.
54+
- Reduce memory usage when decoding a variable with a scale_factor, by
55+
converting 8-bit and 16-bit integers to float32 instead of float64 (:pull:`1840`).
56+
By `Zac Hatfield-Dodds <https://github.com/Zac-HD>`_.
5457

5558
.. _Zarr: http://zarr.readthedocs.io/
5659

@@ -67,11 +70,9 @@ Bug fixes
6770
- Fixed encoding of multi-dimensional coordinates in
6871
:py:meth:`~Dataset.to_netcdf` (:issue:`1763`).
6972
By `Mike Neish <https://github.com/neishm>`_.
70-
7173
- Bug fix in open_dataset(engine='pydap') (:issue:`1775`)
7274
By `Keisuke Fujii <https://github.com/fujiisoup>`_.
73-
74-
- Bug fix in vectorized assignment (:issue:`1743`, `1744`).
75+
- Bug fix in vectorized assignment (:issue:`1743`, :issue:`1744`).
7576
Now item assignment to :py:meth:`~DataArray.__setitem__` checks
7677
- Bug fix in vectorized assignment (:issue:`1743`, :issue:`1744`).
7778
Now item assignment to :py:meth:`DataArray.__setitem__` checks

xarray/coding/variables.py

+18-2
Original file line numberDiff line numberDiff line change
@@ -212,11 +212,27 @@ class CFScaleOffsetCoder(VariableCoder):
212212
decode_values = encoded_values * scale_factor + add_offset
213213
"""
214214

215+
@staticmethod
216+
def _choose_float_dtype(data, has_offset):
217+
"""Return a float dtype sufficient to losslessly represent `data`."""
218+
# float32 can exactly represent all integers up to 24 bits
219+
if data.dtype.itemsize <= 2 and np.issubdtype(data.dtype, np.integer):
220+
# A scale factor is entirely safe (vanishing into the mantissa),
221+
# but a large integer offset could lead to loss of precision.
222+
# Sensitivity analysis can be tricky, so we just use a float64
223+
# if there's any offset at all - better unoptimised than wrong!
224+
if not has_offset:
225+
return np.float32
226+
# For all other types and circumstances, we just use float64.
227+
# (safe because eg. complex numbers are not supported in NetCDF)
228+
return np.float64
229+
215230
def encode(self, variable, name=None):
216231
dims, data, attrs, encoding = unpack_for_encoding(variable)
217232

218233
if 'scale_factor' in encoding or 'add_offset' in encoding:
219-
data = data.astype(dtype=np.float64, copy=True)
234+
dtype = self._choose_float_dtype(data, 'add_offset' in encoding)
235+
data = data.astype(dtype=dtype, copy=True)
220236
if 'add_offset' in encoding:
221237
data -= pop_to(encoding, attrs, 'add_offset', name=name)
222238
if 'scale_factor' in encoding:
@@ -230,7 +246,7 @@ def decode(self, variable, name=None):
230246
if 'scale_factor' in attrs or 'add_offset' in attrs:
231247
scale_factor = pop_to(attrs, encoding, 'scale_factor', name=name)
232248
add_offset = pop_to(attrs, encoding, 'add_offset', name=name)
233-
dtype = np.float64
249+
dtype = self._choose_float_dtype(data, 'add_offset' in attrs)
234250
transform = partial(_scale_offset_decoding,
235251
scale_factor=scale_factor,
236252
add_offset=add_offset,

0 commit comments

Comments
 (0)