-
Notifications
You must be signed in to change notification settings - Fork 35
Improve default VCF compression #937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve default VCF compression #937
Conversation
Issue warning for VCF FORMAT float fields
109c194
to
6260a9b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! One question about testing.
A thought - you could add a generic asv
benchmark for compression ratio of some example VCFs:
https://asv.readthedocs.io/en/stable/writing_benchmarks.html#tracking-generic
or "filters" not in merged_encoding[var] | ||
) | ||
): | ||
warnings.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to have an explicit test for the warning with pytest.warns
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea - added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I opened #938 for this. |
This implements some of the findings in #925
I considered always rounding FORMAT float fields to 2dp, but there are problems with NaN (can't use FixedScaleOffset since it doesn't preserve NaNs) and precision (which integer type to use for FixedScaleOffset depends on the range of the values).
Quantize and Bitround do better here (both support NaN), but they are not as compact as FixedScaleOffset. It's probably worth doing more investigation into them (another PR?), but so far in this PR I have just added a warning that suggests setting filters, which is a good start.