Skip to content

Improve default VCF compression #937

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Oct 18, 2022

Conversation

tomwhite
Copy link
Collaborator

This implements some of the findings in #925

  • Adds Delta encoding (filter) for VCF POS fields
  • Issue a warning if there are float FORMAT fields with no filters set
  • Adds a section of documention to the VCF covering compression, and giving an example of how to truncate float fields to 2dp.
  • 0efd28a is a fix for a bug where if you specify the filters for a dataset variable then the default sgkit compressor is ignored since the dictionaries weren't being merged correctly.

I considered always rounding FORMAT float fields to 2dp, but there are problems with NaN (can't use FixedScaleOffset since it doesn't preserve NaNs) and precision (which integer type to use for FixedScaleOffset depends on the range of the values).

Quantize and Bitround do better here (both support NaN), but they are not as compact as FixedScaleOffset. It's probably worth doing more investigation into them (another PR?), but so far in this PR I have just added a warning that suggests setting filters, which is a good start.

@tomwhite tomwhite marked this pull request as ready for review October 18, 2022 11:28
@tomwhite tomwhite force-pushed the improve-default-vcf-compression branch from 109c194 to 6260a9b Compare October 18, 2022 12:07
Copy link
Collaborator

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! One question about testing.

A thought - you could add a generic asv benchmark for compression ratio of some example VCFs:
https://asv.readthedocs.io/en/stable/writing_benchmarks.html#tracking-generic

or "filters" not in merged_encoding[var]
)
):
warnings.warn(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to have an explicit test for the warning with pytest.warns?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea - added

Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tomwhite
Copy link
Collaborator Author

A thought - you could add a generic asv benchmark for compression ratio of some example VCFs:
https://asv.readthedocs.io/en/stable/writing_benchmarks.html#tracking-generic

I opened #938 for this.

@jeromekelleher jeromekelleher added the auto-merge Auto merge label for mergify test flight label Oct 18, 2022
@mergify mergify bot merged commit 0454e91 into sgkit-dev:main Oct 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants