@@ -181,6 +181,47 @@ cloud storage. You can access files stored on Amazon S3 or Google Cloud Storage
181
181
using ``s3:// `` or ``gs:// `` URLs. Setting credentials or other options is
182
182
typically achieved using environment variables for the underlying cloud store.
183
183
184
+ Compression
185
+ -----------
186
+
187
+ Zarr offers a lot of flexibility over controlling how data is compressed. Each variable can use
188
+ a different `compression algorithm <https://zarr.readthedocs.io/en/stable/tutorial.html#compressors >`_,
189
+ and its own list of `filters <https://zarr.readthedocs.io/en/stable/tutorial.html#filters >`_.
190
+
191
+ The :func: `sgkit.io.vcf.vcf_to_zarr ` function tries to choose good defaults for compression, using
192
+ information about the variable's dtype, and also the nature of the data being stored.
193
+
194
+ For example, ``variant_position `` (from the VCF ``POS `` field) is a monotonically increasing integer
195
+ (within a contig) so it benefits from using a delta encoding to store the differences in its values,
196
+ since these are smaller integers that compress better. This encoding is specified using the NumCodecs
197
+ `Delta <https://numcodecs.readthedocs.io/en/stable/delta.html >`_ codec as a Zarr filter.
198
+
199
+ When converting from VCF you can specify the default compression algorithm to use for all variables
200
+ by specifying ``compressor `` in the call to :func: `sgkit.io.vcf.vcf_to_zarr `. There are trade-offs
201
+ between compression speed and size, which this `benchmark <http://alimanfoo.github.io/2016/09/21/genotype-compression-benchmark.html >`_
202
+ does a good job of exploring.
203
+
204
+ Sometimes you may want to override the compression for a particular variable. A good example of this
205
+ is for VCF FORMAT fields that are floats. Floats don't compress well, and since there is a value for
206
+ every sample they can take up a lot of space. In many cases full float precision is not needed,
207
+ so it is a good idea to use a filter to transform the float to an int, that takes less space.
208
+
209
+ For example, the following code creates an encoding that can be passed to :func: `sgkit.io.vcf.vcf_to_zarr `
210
+ to store the VCF ``DS `` FORMAT field to 2 decimal places. (``DS `` is a dosage field that is between 0 and 2
211
+ so we know it will fit into an unsigned 8-bit int.)::
212
+
213
+ from numcodecs import FixedScaleOffset
214
+
215
+ encoding = {
216
+ "call_DS": {
217
+ "filters": [FixedScaleOffset(offset=0, scale=100, dtype="f4", astype="u1")],
218
+ },
219
+ }
220
+
221
+ Note that this encoding won't work for floats that may be NaN. Consider using
222
+ `Quantize <https://numcodecs.readthedocs.io/en/stable/quantize.html >`_ (with ``astype=np.float16 ``)
223
+ or `Bitround <https://numcodecs.readthedocs.io/en/stable/bitround.html >`_ in that case.
224
+
184
225
.. _vcf_low_level_operation :
185
226
186
227
Low-level operation
0 commit comments