Skip to content

maximise lossless compression of vcf_to_zarr #925

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elswob opened this issue Oct 6, 2022 · 19 comments
Closed

maximise lossless compression of vcf_to_zarr #925

elswob opened this issue Oct 6, 2022 · 19 comments

Comments

@elswob
Copy link

elswob commented Oct 6, 2022

Moving and storing large VCF files is a problem. One option is to transform them into a more compact data format, which contains enough data to fully restore the VCF at a later date (see https://github.com/pystatgen/sgkit/issues/924). The default vcf_to_zarr reduces the size of the data significantly, however, when extra fields are included, e.g. ["INFO/*", "FORMAT/*"] the resulting zarr structure can be larger than the original VCF.

@tomwhite
Copy link
Collaborator

tomwhite commented Oct 6, 2022

Hi @elswob, I did some work on this last year and found that on 1000 genomes chr22 (GT sparsity = 3.7%) I could store the extra fields at 46% of the size of BCF (i.e. about half the size).

You can see the details in slide 10 of this presentation, and the code is here. Using the zstd compressor helped a lot I think.

@tomwhite
Copy link
Collaborator

tomwhite commented Oct 6, 2022

BTW there's also a summary on https://github.com/tomwhite/ga4gh-variant-comparison

@jeromekelleher
Copy link
Collaborator

Have you got an example where the we don't do so well on compression @elswob? I'm guessing it's some particular INFO or FORMAT field that's not being dealt with well.

@elswob
Copy link
Author

elswob commented Oct 6, 2022

Hi both, thanks for the comments. Have you tried genozip https://genozip.readthedocs.io/index.html ? Be interested to see how that compares.

I can confirm similar numbers using sgkit for the 1000 genomes data used in the comparison. The file I'm seeing strange values with is a beagle imputation output VCF for around 1 million variants and 100 samples based on 1000 genomes data, around 40 MB. If I run it with the params here I get a 138 MB Zarr directory.

@jeromekelleher
Copy link
Collaborator

Hmm, I wonder if it's the genotype probability field that's taking up the space. Can you do a du -sh on the zarr directory please to see what the size breakdown on the columns is?

@elswob
Copy link
Author

elswob commented Oct 6, 2022

128M	call_DS
1.2M	call_genotype
144K	call_genotype_mask
140K	call_genotype_phased
 12K	sample_id
3.5M	variant_AF
920K	variant_DR2
 76K	variant_IMP
2.7M	variant_allele
 76K	variant_contig
 76K	variant_id
 76K	variant_id_mask
1.2M	variant_position

@jeromekelleher
Copy link
Collaborator

There we go - I bet we're storing call_DS as a 32 or 64 bit float which isn't compressing very well.

@tomwhite
Copy link
Collaborator

tomwhite commented Oct 7, 2022

I bet we're storing call_DS as a 32 or 64 bit float which isn't compressing very well.

It will be a 32-bit float if it's a VCF float.

@elswob have you set max_alt_alleles in the call to vcf_to_zarr? We are storing fixed length arrays in Zarr, so the genotypes dimension can grow rapidly with max_alt_alleles, even if it's mainly filled with empty values. One thing I wondered about is using Zarr ragged arrays for cases like this. I'm not sure how well it plays with Xarray, or indeed how efficient it is when processing though. (Note there was some discussion about Zarr ragged arrays in #634, but here we want to store floats, so the problem discussed there about storing ragged arrays of strings doesn't arise.)

Could you share the Xarray Dataset repr so we can see the dimensions and dataset attributes (e.g. max_alt_alleles_seen)?

Another thought I had is trying out codecs like Quantize or Bitround. They are lossy though.

@elswob
Copy link
Author

elswob commented Oct 7, 2022

@tomwhite I have not set max_alt_alleles in the call. Just tried with defaults, and the parameters in the example you provided.

Screenshot 2022-10-07 at 12 39 48

@tomwhite
Copy link
Collaborator

tomwhite commented Oct 7, 2022

@elswob thanks for the screenshot. The fact that max_alt_alleles_seen is 1 means you could set max_alt_alleles to 1 (for this dataset at least). Might be worth re-running to see what effect this has on the size of call_DS is in this case. The number of genotypes will go from 10 to 3 if my maths is correct.

@elswob
Copy link
Author

elswob commented Oct 7, 2022

Thanks, that helps a bit, down to 75 MB using these params:

fields=["INFO/*", "FORMAT/*"], 
max_alt_alleles=1,

(I tried the zstd compressor but that resulted in larger output)

Screenshot 2022-10-07 at 13 18 49

 65M	call_DS
628K	call_genotype
436K	call_genotype_mask
436K	call_genotype_phased
 12K	sample_id
3.1M	variant_AF
932K	variant_DR2
436K	variant_IMP
1.8M	variant_allele
436K	variant_contig
436K	variant_id
436K	variant_id_mask
1.4M	variant_position

@tomwhite
Copy link
Collaborator

tomwhite commented Oct 7, 2022

Thanks. Uncompressed, the DS field would take 406 MB, so that's a 6X compression factor with Zarr. I'm still puzzled why gzipped VCF (I assume that's what you are comparing to) is smaller though. Can you share more info about sparsity, typical values of the DS field, etc? Or maybe post your script to generate the VCF from 1000 genomes chr22?

@jeromekelleher
Copy link
Collaborator

I'm still puzzled why gzipped VCF is smaller though

I assume these are two digit base-10 values, which compress poorly when converted to float32? This is a common issue with converting these types of per-genotype floating point values, which take up a lot of space and don't contain a great deal of information. I think they need to be treated with some specific filters to get good performance.

@tomwhite
Copy link
Collaborator

I assume these are two digit base-10 values, which compress poorly when converted to float32? This is a common issue with converting these types of per-genotype floating point values, which take up a lot of space and don't contain a great deal of information.

Yes, that makes sense.

I think they need to be treated with some specific filters to get good performance.

Do you think the numcodecs filters like Quantize or Bitround would be suitable, or did you have something else in mind?

I'd be interested in trying out different filters if I could get some representative data to try them on.

@jeromekelleher
Copy link
Collaborator

Do you think the numcodecs filters like Quantize or Bitround would be suitable, or did you have something else in mind?

Yes, these are the types of things that should help I would imagine. There's a relatively small number of discrete things that are being stored (100 or 1000) and filters like this should map them into the right integer range to store with one byte rather than four.

I had a quick look around, but can't find any imputed datasets lying around. Any thoughts on where we'd find something representative @elswob?

@tomwhite
Copy link
Collaborator

I now have a representative VCF that is 17MB in size (compressed).

Running vct_to_zarr on it with the default settings:

vcf_to_zarr(input=vcf_file, fields=["INFO/*", "FORMAT/*"], output=output, max_alt_alleles=1)

produced Zarr files totalling 37MB in size, as @elswob reported.

The following reduced the size to 14MB:

from numcodecs import BZ2, Blosc, Delta, FixedScaleOffset

from sgkit.io.vcf import vcf_to_zarr

compressor = Blosc(cname="zstd", clevel=7, shuffle=Blosc.AUTOSHUFFLE)
encoding = {
    "variant_position": {
        "filters": [Delta(dtype="i4", astype="i4")],
        "compressor": compressor,
    },
    "variant_AF": {
        "filters": [FixedScaleOffset(offset=0, scale=10000, dtype="f4", astype="u2")],
        "compressor": compressor,
    },
    "call_DS": {
        "filters": [FixedScaleOffset(offset=0, scale=100, dtype="f4", astype="u1")],
        "compressor": compressor,
    },
    "variant_DR2": {
        "filters": [FixedScaleOffset(offset=0, scale=100, dtype="f4", astype="u1")],
        "compressor": compressor,
    },
}
vcf_to_zarr(
    input=vcf_file,
    fields=["INFO/*", "FORMAT/*"],
    output=output,
    chunk_length=500_000,
    max_alt_alleles=1,
    compressor=compressor,
    encoding=encoding,
)

There are three things here that helped:

  1. Use FixedScaleOffset filters on float fields, so only a few decimal places are stored. (I didn't try Quantize or Bitround, but the effect would probably be similar.)
  2. Use Delta on POS, which is a monotonically increasing value (except at chromosome boundaries), so benefits from having this filter to store differences.
  3. Increase the chunk size in the variants dimension from 10,000 to 500,000. This depends a lot on the number of samples, which in this case was very small (~10).

I also tried using bzip2 compression:

compressor = BZ2(9)

With this (and reducing the number of decimal places for AF to 2), I got the size down to 8.3MB, albeit with a much slower decompression speed. This is about 25% larger than genozip (which also uses bzip2).

I think there are some things we can incorporate into the defaults (e.g. POS Delta filter), and others we should document (e.g. storing float fields).

@tomwhite
Copy link
Collaborator

@ravwojdyla pointed out that #80 is related.

@tomwhite
Copy link
Collaborator

With #943 I was able to get the Zarr size down to about 16% larger than genozip on the test file (using bzip2).

@tomwhite
Copy link
Collaborator

tomwhite commented May 7, 2025

Closing this since this is now covered by bio2zarr and the paper

@tomwhite tomwhite closed this as completed May 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants