Skip to content

Commit 109c194

Browse files
committed
Add section on compression to VCF docs
1 parent fadbaf8 commit 109c194

File tree

1 file changed

+41
-0
lines changed

1 file changed

+41
-0
lines changed

docs/vcf.rst

+41
Original file line numberDiff line numberDiff line change
@@ -181,6 +181,47 @@ cloud storage. You can access files stored on Amazon S3 or Google Cloud Storage
181181
using ``s3://`` or ``gs://`` URLs. Setting credentials or other options is
182182
typically achieved using environment variables for the underlying cloud store.
183183

184+
Compression
185+
-----------
186+
187+
Zarr offers a lot of flexibility over controlling how data is compressed. Each variable can use
188+
a different `compression algorithm <https://zarr.readthedocs.io/en/stable/tutorial.html#compressors>`_,
189+
and its own list of `filters <https://zarr.readthedocs.io/en/stable/tutorial.html#filters>`_.
190+
191+
The :func:`sgkit.io.vcf.vcf_to_zarr` function tries to choose good defaults for compression, using
192+
information about the variable's dtype, and also the nature of the data being stored.
193+
194+
For example, ``variant_position`` (from the VCF ``POS`` field) is a monotonically increasing integer
195+
(within a contig) so it benefits from using a delta encoding to store the differences in its values,
196+
since these are smaller integers that compress better. This encoding is specified using the NumCodecs
197+
`Delta <https://numcodecs.readthedocs.io/en/stable/delta.html>`_ codec as a Zarr filter.
198+
199+
When converting from VCF you can specify the default compression algorithm to use for all variables
200+
by specifying ``compressor`` in the call to :func:`sgkit.io.vcf.vcf_to_zarr`. There are trade-offs
201+
between compression speed and size, which this `benchmark <http://alimanfoo.github.io/2016/09/21/genotype-compression-benchmark.html>`_
202+
does a good job of exploring.
203+
204+
Sometimes you may want to override the compression for a particular variable. A good example of this
205+
is for VCF FORMAT fields that are floats. Floats don't compress well, and since there is a value for
206+
every sample they can take up a lot of space. In many cases full float precision is not needed,
207+
so it is a good idea to use a filter to transform the float to an int, that takes less space.
208+
209+
For example, the following code creates an encoding that can be passed to :func:`sgkit.io.vcf.vcf_to_zarr`
210+
to store the VCF ``DS`` FORMAT field to 2 decimal places. (``DS`` is a dosage field that is between 0 and 2
211+
so we know it will fit into an unsigned 8-bit int.)::
212+
213+
from numcodecs import FixedScaleOffset
214+
215+
encoding = {
216+
"call_DS": {
217+
"filters": [FixedScaleOffset(offset=0, scale=100, dtype="f4", astype="u1")],
218+
},
219+
}
220+
221+
Note that this encoding won't work for floats that may be NaN. Consider using
222+
`Quantize <https://numcodecs.readthedocs.io/en/stable/quantize.html>`_ (with ``astype=np.float16``)
223+
or `Bitround <https://numcodecs.readthedocs.io/en/stable/bitround.html>`_ in that case.
224+
184225
.. _vcf_low_level_operation:
185226

186227
Low-level operation

0 commit comments

Comments
 (0)