maximise lossless compression of vcf_to_zarr #925

elswob · 2022-10-06T11:35:49Z

Moving and storing large VCF files is a problem. One option is to transform them into a more compact data format, which contains enough data to fully restore the VCF at a later date (see https://github.com/pystatgen/sgkit/issues/924). The default vcf_to_zarr reduces the size of the data significantly, however, when extra fields are included, e.g. ["INFO/*", "FORMAT/*"] the resulting zarr structure can be larger than the original VCF.

The text was updated successfully, but these errors were encountered:

tomwhite · 2022-10-06T12:03:08Z

Hi @elswob, I did some work on this last year and found that on 1000 genomes chr22 (GT sparsity = 3.7%) I could store the extra fields at 46% of the size of BCF (i.e. about half the size).

You can see the details in slide 10 of this presentation, and the code is here. Using the zstd compressor helped a lot I think.

tomwhite · 2022-10-06T12:04:17Z

BTW there's also a summary on https://github.com/tomwhite/ga4gh-variant-comparison

jeromekelleher · 2022-10-06T12:23:15Z

Have you got an example where the we don't do so well on compression @elswob? I'm guessing it's some particular INFO or FORMAT field that's not being dealt with well.

elswob · 2022-10-06T16:32:53Z

Hi both, thanks for the comments. Have you tried genozip https://genozip.readthedocs.io/index.html ? Be interested to see how that compares.

I can confirm similar numbers using sgkit for the 1000 genomes data used in the comparison. The file I'm seeing strange values with is a beagle imputation output VCF for around 1 million variants and 100 samples based on 1000 genomes data, around 40 MB. If I run it with the params here I get a 138 MB Zarr directory.

jeromekelleher · 2022-10-06T16:35:54Z

Hmm, I wonder if it's the genotype probability field that's taking up the space. Can you do a du -sh on the zarr directory please to see what the size breakdown on the columns is?

elswob · 2022-10-06T17:52:18Z

128M	call_DS
1.2M	call_genotype
144K	call_genotype_mask
140K	call_genotype_phased
 12K	sample_id
3.5M	variant_AF
920K	variant_DR2
 76K	variant_IMP
2.7M	variant_allele
 76K	variant_contig
 76K	variant_id
 76K	variant_id_mask
1.2M	variant_position

jeromekelleher · 2022-10-06T18:19:04Z

There we go - I bet we're storing call_DS as a 32 or 64 bit float which isn't compressing very well.

tomwhite · 2022-10-07T08:58:33Z

I bet we're storing call_DS as a 32 or 64 bit float which isn't compressing very well.

It will be a 32-bit float if it's a VCF float.

@elswob have you set max_alt_alleles in the call to vcf_to_zarr? We are storing fixed length arrays in Zarr, so the genotypes dimension can grow rapidly with max_alt_alleles, even if it's mainly filled with empty values. One thing I wondered about is using Zarr ragged arrays for cases like this. I'm not sure how well it plays with Xarray, or indeed how efficient it is when processing though. (Note there was some discussion about Zarr ragged arrays in #634, but here we want to store floats, so the problem discussed there about storing ragged arrays of strings doesn't arise.)

Could you share the Xarray Dataset repr so we can see the dimensions and dataset attributes (e.g. max_alt_alleles_seen)?

Another thought I had is trying out codecs like Quantize or Bitround. They are lossy though.

elswob · 2022-10-07T11:40:37Z

@tomwhite I have not set max_alt_alleles in the call. Just tried with defaults, and the parameters in the example you provided.

tomwhite · 2022-10-07T11:56:57Z

@elswob thanks for the screenshot. The fact that max_alt_alleles_seen is 1 means you could set max_alt_alleles to 1 (for this dataset at least). Might be worth re-running to see what effect this has on the size of call_DS is in this case. The number of genotypes will go from 10 to 3 if my maths is correct.

elswob · 2022-10-07T12:19:54Z

Thanks, that helps a bit, down to 75 MB using these params:

fields=["INFO/*", "FORMAT/*"], 
max_alt_alleles=1,

(I tried the zstd compressor but that resulted in larger output)

 65M	call_DS
628K	call_genotype
436K	call_genotype_mask
436K	call_genotype_phased
 12K	sample_id
3.1M	variant_AF
932K	variant_DR2
436K	variant_IMP
1.8M	variant_allele
436K	variant_contig
436K	variant_id
436K	variant_id_mask
1.4M	variant_position

tomwhite · 2022-10-07T15:47:57Z

Thanks. Uncompressed, the DS field would take 406 MB, so that's a 6X compression factor with Zarr. I'm still puzzled why gzipped VCF (I assume that's what you are comparing to) is smaller though. Can you share more info about sparsity, typical values of the DS field, etc? Or maybe post your script to generate the VCF from 1000 genomes chr22?

jeromekelleher · 2022-10-07T16:49:47Z

I'm still puzzled why gzipped VCF is smaller though

I assume these are two digit base-10 values, which compress poorly when converted to float32? This is a common issue with converting these types of per-genotype floating point values, which take up a lot of space and don't contain a great deal of information. I think they need to be treated with some specific filters to get good performance.

tomwhite · 2022-10-10T09:38:23Z

I assume these are two digit base-10 values, which compress poorly when converted to float32? This is a common issue with converting these types of per-genotype floating point values, which take up a lot of space and don't contain a great deal of information.

Yes, that makes sense.

I think they need to be treated with some specific filters to get good performance.

Do you think the numcodecs filters like Quantize or Bitround would be suitable, or did you have something else in mind?

I'd be interested in trying out different filters if I could get some representative data to try them on.

jeromekelleher · 2022-10-10T10:07:00Z

Do you think the numcodecs filters like Quantize or Bitround would be suitable, or did you have something else in mind?

Yes, these are the types of things that should help I would imagine. There's a relatively small number of discrete things that are being stored (100 or 1000) and filters like this should map them into the right integer range to store with one byte rather than four.

I had a quick look around, but can't find any imputed datasets lying around. Any thoughts on where we'd find something representative @elswob?

tomwhite · 2022-10-11T11:20:36Z

I now have a representative VCF that is 17MB in size (compressed).

Running vct_to_zarr on it with the default settings:

vcf_to_zarr(input=vcf_file, fields=["INFO/*", "FORMAT/*"], output=output, max_alt_alleles=1)

produced Zarr files totalling 37MB in size, as @elswob reported.

The following reduced the size to 14MB:

from numcodecs import BZ2, Blosc, Delta, FixedScaleOffset

from sgkit.io.vcf import vcf_to_zarr

compressor = Blosc(cname="zstd", clevel=7, shuffle=Blosc.AUTOSHUFFLE)
encoding = {
    "variant_position": {
        "filters": [Delta(dtype="i4", astype="i4")],
        "compressor": compressor,
    },
    "variant_AF": {
        "filters": [FixedScaleOffset(offset=0, scale=10000, dtype="f4", astype="u2")],
        "compressor": compressor,
    },
    "call_DS": {
        "filters": [FixedScaleOffset(offset=0, scale=100, dtype="f4", astype="u1")],
        "compressor": compressor,
    },
    "variant_DR2": {
        "filters": [FixedScaleOffset(offset=0, scale=100, dtype="f4", astype="u1")],
        "compressor": compressor,
    },
}
vcf_to_zarr(
    input=vcf_file,
    fields=["INFO/*", "FORMAT/*"],
    output=output,
    chunk_length=500_000,
    max_alt_alleles=1,
    compressor=compressor,
    encoding=encoding,
)

There are three things here that helped:

Use FixedScaleOffset filters on float fields, so only a few decimal places are stored. (I didn't try Quantize or Bitround, but the effect would probably be similar.)
Use Delta on POS, which is a monotonically increasing value (except at chromosome boundaries), so benefits from having this filter to store differences.
Increase the chunk size in the variants dimension from 10,000 to 500,000. This depends a lot on the number of samples, which in this case was very small (~10).

I also tried using bzip2 compression:

compressor = BZ2(9)

With this (and reducing the number of decimal places for AF to 2), I got the size down to 8.3MB, albeit with a much slower decompression speed. This is about 25% larger than genozip (which also uses bzip2).

I think there are some things we can incorporate into the defaults (e.g. POS Delta filter), and others we should document (e.g. storing float fields).

tomwhite · 2022-10-13T16:13:03Z

@ravwojdyla pointed out that #80 is related.

tomwhite · 2022-10-25T15:19:34Z

With #943 I was able to get the Zarr size down to about 16% larger than genozip on the test file (using bzip2).

tomwhite · 2025-05-07T16:32:07Z

Closing this since this is now covered by bio2zarr and the paper

This was referenced Oct 18, 2022

Improve default VCF compression #937

Merged

VCF compression size test #941

Closed

Test for VCF lossless storage when using encoding #942

Merged

tomwhite mentioned this issue Nov 16, 2022

VCF writer using numba #953

Merged

tomwhite closed this as completed May 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

maximise lossless compression of vcf_to_zarr #925

maximise lossless compression of vcf_to_zarr #925

elswob commented Oct 6, 2022

tomwhite commented Oct 6, 2022

tomwhite commented Oct 6, 2022

jeromekelleher commented Oct 6, 2022

elswob commented Oct 6, 2022

jeromekelleher commented Oct 6, 2022

elswob commented Oct 6, 2022

jeromekelleher commented Oct 6, 2022

tomwhite commented Oct 7, 2022

elswob commented Oct 7, 2022

tomwhite commented Oct 7, 2022

elswob commented Oct 7, 2022

tomwhite commented Oct 7, 2022

jeromekelleher commented Oct 7, 2022

tomwhite commented Oct 10, 2022

jeromekelleher commented Oct 10, 2022

tomwhite commented Oct 11, 2022

tomwhite commented Oct 13, 2022

tomwhite commented Oct 25, 2022

tomwhite commented May 7, 2025

maximise lossless compression of vcf_to_zarr #925

maximise lossless compression of vcf_to_zarr #925

Comments

elswob commented Oct 6, 2022

tomwhite commented Oct 6, 2022

tomwhite commented Oct 6, 2022

jeromekelleher commented Oct 6, 2022

elswob commented Oct 6, 2022

jeromekelleher commented Oct 6, 2022

elswob commented Oct 6, 2022

jeromekelleher commented Oct 6, 2022

tomwhite commented Oct 7, 2022

elswob commented Oct 7, 2022

tomwhite commented Oct 7, 2022

elswob commented Oct 7, 2022

tomwhite commented Oct 7, 2022

jeromekelleher commented Oct 7, 2022

tomwhite commented Oct 10, 2022

jeromekelleher commented Oct 10, 2022

tomwhite commented Oct 11, 2022

tomwhite commented Oct 13, 2022

tomwhite commented Oct 25, 2022

tomwhite commented May 7, 2025