-
Notifications
You must be signed in to change notification settings - Fork 35
maximise lossless compression of vcf_to_zarr #925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @elswob, I did some work on this last year and found that on 1000 genomes chr22 (GT sparsity = 3.7%) I could store the extra fields at 46% of the size of BCF (i.e. about half the size). You can see the details in slide 10 of this presentation, and the code is here. Using the |
BTW there's also a summary on https://github.com/tomwhite/ga4gh-variant-comparison |
Have you got an example where the we don't do so well on compression @elswob? I'm guessing it's some particular INFO or FORMAT field that's not being dealt with well. |
Hi both, thanks for the comments. Have you tried genozip https://genozip.readthedocs.io/index.html ? Be interested to see how that compares. I can confirm similar numbers using sgkit for the 1000 genomes data used in the comparison. The file I'm seeing strange values with is a beagle imputation output VCF for around 1 million variants and 100 samples based on 1000 genomes data, around 40 MB. If I run it with the params here I get a 138 MB Zarr directory. |
Hmm, I wonder if it's the genotype probability field that's taking up the space. Can you do a |
|
There we go - I bet we're storing |
It will be a 32-bit float if it's a VCF float. @elswob have you set Could you share the Xarray Dataset repr so we can see the dimensions and dataset attributes (e.g. Another thought I had is trying out codecs like Quantize or Bitround. They are lossy though. |
@tomwhite I have not set |
@elswob thanks for the screenshot. The fact that |
Thanks, that helps a bit, down to 75 MB using these params:
(I tried the
|
Thanks. Uncompressed, the DS field would take 406 MB, so that's a 6X compression factor with Zarr. I'm still puzzled why gzipped VCF (I assume that's what you are comparing to) is smaller though. Can you share more info about sparsity, typical values of the DS field, etc? Or maybe post your script to generate the VCF from 1000 genomes chr22? |
I assume these are two digit base-10 values, which compress poorly when converted to float32? This is a common issue with converting these types of per-genotype floating point values, which take up a lot of space and don't contain a great deal of information. I think they need to be treated with some specific filters to get good performance. |
Yes, that makes sense.
Do you think the numcodecs filters like Quantize or Bitround would be suitable, or did you have something else in mind? I'd be interested in trying out different filters if I could get some representative data to try them on. |
Yes, these are the types of things that should help I would imagine. There's a relatively small number of discrete things that are being stored (100 or 1000) and filters like this should map them into the right integer range to store with one byte rather than four. I had a quick look around, but can't find any imputed datasets lying around. Any thoughts on where we'd find something representative @elswob? |
I now have a representative VCF that is 17MB in size (compressed). Running vcf_to_zarr(input=vcf_file, fields=["INFO/*", "FORMAT/*"], output=output, max_alt_alleles=1) produced Zarr files totalling 37MB in size, as @elswob reported. The following reduced the size to 14MB: from numcodecs import BZ2, Blosc, Delta, FixedScaleOffset
from sgkit.io.vcf import vcf_to_zarr
compressor = Blosc(cname="zstd", clevel=7, shuffle=Blosc.AUTOSHUFFLE)
encoding = {
"variant_position": {
"filters": [Delta(dtype="i4", astype="i4")],
"compressor": compressor,
},
"variant_AF": {
"filters": [FixedScaleOffset(offset=0, scale=10000, dtype="f4", astype="u2")],
"compressor": compressor,
},
"call_DS": {
"filters": [FixedScaleOffset(offset=0, scale=100, dtype="f4", astype="u1")],
"compressor": compressor,
},
"variant_DR2": {
"filters": [FixedScaleOffset(offset=0, scale=100, dtype="f4", astype="u1")],
"compressor": compressor,
},
}
vcf_to_zarr(
input=vcf_file,
fields=["INFO/*", "FORMAT/*"],
output=output,
chunk_length=500_000,
max_alt_alleles=1,
compressor=compressor,
encoding=encoding,
) There are three things here that helped:
I also tried using bzip2 compression: compressor = BZ2(9) With this (and reducing the number of decimal places for AF to 2), I got the size down to 8.3MB, albeit with a much slower decompression speed. This is about 25% larger than genozip (which also uses bzip2). I think there are some things we can incorporate into the defaults (e.g. POS |
@ravwojdyla pointed out that #80 is related. |
With #943 I was able to get the Zarr size down to about 16% larger than genozip on the test file (using bzip2). |
Closing this since this is now covered by bio2zarr and the paper |
Moving and storing large VCF files is a problem. One option is to transform them into a more compact data format, which contains enough data to fully restore the VCF at a later date (see https://github.com/pystatgen/sgkit/issues/924). The default vcf_to_zarr reduces the size of the data significantly, however, when extra fields are included, e.g.
["INFO/*", "FORMAT/*"]
the resulting zarr structure can be larger than the original VCF.The text was updated successfully, but these errors were encountered: