-
Notifications
You must be signed in to change notification settings - Fork 35
Genotype call array to dosage #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Perhaps this issue should be merged with #3? |
In #38 I have created an implementation that
|
I don't want to complicate things too early but as WES/WGS becomes more common than genotyping arrays, multiallelic sites become more common. Should we strive for simplicity early (i.e. biallelic only) so that we can solve for other issues in library design now, or should we design for complexity up front (i.e. multiallelic sites) so that we don't have to revise things later? I can see arguments for both approaches, so I'm happy to defer to the team's desires. I think @alimanfoo already works primarily with WGS data, so getting his insights would be valuable. |
In general I think we need to assume everything is multiallelic, from the start. However, is it clear that we need to worry about this for dosages? This is a stupid question, but can we get dosage data for anything other than genotype arrays? |
It's pretty common in GWAS to compute dosage based on genotype likelihood from WES/WGS for regression rather than hard calls, so that's one example at least. |
OK, seems to me like we have to grasp the multi-allelic nettle from the start then. |
+1 for thinking multiallelic from the start, will make some entomologists very happy :-) (And also +1 for thinking about any ploidy from the start - will hopefully make us some botanist friends :-)) |
I'm not an expert here, but I was wondering if we should consider separately these two different cases:
Genotype dosage is in general I believe modelled as a continuous variable, i.e., should have a float data type and can take any value between 0 and 2 (for diploid biallelic). I'm guessing genotype dosage is rarely calculated for multiallelic variants. Perhaps this issue was badly named and/or should be split into two? |
Just to add, perhaps there are in fact three cases:
The last two cases may be structurally identical in terms of how we represent the data, but are conceptually slightly different and wanted to check these are all paths we want to cover? |
Also mentioning one other consideration, handling of missing data. This might not be the best place to discuss, but I'm conscious that converting genotype calls to some kind of allele dosage-like representation can induce a bias in the presence of missing data. This may not ever come up when dealing with data from chip + imputation because there are no missing calls. But will come up when dealing with WES/WGS data. The obvious thing to avoid is creating a bias towards the reference allele. E.g., if you assume biallelic sites, and you compute dosage of the alternate allele, then both a hom ref call and a missing call end up with the same dosage (0). |
Should've read this before I made previous comment :-) |
One question I will have at some point, is how do you deal with missing calls in PCA? A related question, how do you deal with multiallelic variants in PCA? That probably deserves it's own issue, but lodging the thought here for now. |
BTW the most common approach I see for dealing with multiallelic sites when starting from WES/WGS for PCA, kinship estimation, LD pruning, association tests, etc. is to split the sites as new biallelic variants and recode the calls/dosages. Is that something you've used before @alimanfoo? I was assuming we could support more alleles in a lot of analyses with something like a |
Just to add a relevant (but very niche) example to the discussion: In short: +1 for polyploid, multi-allelic, continuous dosage (sorry) |
@alimanfoo may want to do PCA w/ multi-allelic data (cf. #95) that could exercise the more complex representation. Another path forward: enumerate the specific subtasks that need to be resolved for specific methods. |
Sorry @alimanfoo I misunderstood you today when you were asking about using dosages in GWAS. I tend to think of them as dosage = fn(gentoype probabilities) and alt allele counts = fn(hard calls) but I was definitely mistaken in saying that we don't frequently use dosages if you consider alt allele counts to be the same thing. I think this right on, or it matches how I think of these:
Some thoughts on this:
What do you guys think of this as a plan for moving forward:
|
Sorry I think I have confused the nomenclature here. The dosage array that you are describing (n_variants, n_samples) has been used with biallelic SNPs for GWAS in polyploids under the name "continuous genotype values". |
Thanks @timothymillar, makes sense. I do think of a dosage array with that definition but often ignore the alleles dimension so hopefully we're all circling around the same things now. Also that reference is great, thanks for that! In general, I think of multi-allelic considerations as being between the Scylla of losing information on which variants are at the same locus (in the splitting / one-hot encoding case) and the Charybdis of trusting that any statistic that generalizes beyond the bi-allelic case can accurately reflect what impact those alternate alleles have relative to one another (few to none do afaik). I'll probably keep referring to splitting alleles as the solution to dealing with multiple of them because the latter case is much more important to me in our GWAS efforts. In other words, I'd gladly split the alleles rather than use some dosage calculation that equally weights two alternate alleles in a coding region where one is synonymous and the other results in a complete loss of function. I'm assuming not splitting alleles will be more important in popgen stats (e.g. https://github.com/pystatgen/sgkit/issues/94). My hope is that we can generalize as many functions as possible to assume multiple alleles in support of those and then add flags or examples that show how to do the same in the bi-allelic case without losing much efficiency. |
Raising this issue to discuss implementing a function to convert a genotype call array to a dosage array.
In general, the output array should have at least two dimensions, with the first two dimensions being (variants, samples). The array elements give the dosage of each allele, i.e., how many copies of an allele are carried by the individual.
Some questions for discussion
Some related functions (not necessarily what we want to copy, but for reference):
The text was updated successfully, but these errors were encountered: