Skip to content

Use of int8 for call_genotype results in integer overflow with complex variants #640

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
timothymillar opened this issue Jul 26, 2021 · 4 comments · Fixed by #686
Closed
Labels
data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc

Comments

@timothymillar
Copy link
Collaborator

I'm working with some microhaplotype variant calls in which the number of the alleles can be >300 (likely due to poor quality calls at some loci). It would be ideal if the call_genotype dtype was configurable and/or automatically set based on the max_alt_alleles parameter. Similar discussion in #584.

@jeromekelleher
Copy link
Collaborator

+1 - this is definitely an issue. Setting based on max_alt_alleles is good, and I think that seals the deal on mapping any alleles we can't represent to missing data.

@tomwhite tomwhite added the data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc label Jul 26, 2021
@tomwhite
Copy link
Collaborator

It would be ideal if the call_genotype dtype was configurable and/or automatically set based on the max_alt_alleles parameter.

+1

BTW I just opened #643, which is tangentially related. @timothymillar I wonder if you see a few very long alleles in your data?

@timothymillar
Copy link
Collaborator Author

I wonder if you see a few very long alleles in your data

I'm currently working with alleles with lengths 120bp. These are fixed sized "chunks" across the genome (targeted sequencing) so fixed length strings are suitable. But I think we'll use more variable allele lengths in future as these chunks need some tuning.
We also use freebayes quite regularly which can produce highly variable allele lengths. However, most of the data I'm working with is targeted sequencing so the datasets are quite small in the variants dimension.

@tomwhite
Copy link
Collaborator

Good to know - thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants