-
Notifications
You must be signed in to change notification settings - Fork 77
add individual statistics (eg heterozygosity) #166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Out of interest, is this generalizable to non-diploids? And you might need to catch cases like the X chromosome. |
Do we want this for 0.2.0 @petrelharp, or can we push it down the road to 0.2.1? I'd like to ship a reasonably complete set of statistics as soon as we can - the question is where to draw the line. |
Push it down the road, unless there's someone else wanting willing to do this. |
Richard Kerr has asked about this on the email list, so I'm going to write out here how to do this in at least an inefficient way. Concretely, we want to compute mean individual heterozygosity: in other words, we want to (a) compute mean diversity (pi) among all the chromosomes of each individual, and (b) average these values. To do this for a tree sequence import numpy as np
def mean_heterozygosity(ts):
ind_nodes = []
for ind in ts.individuals():
ind_nodes.append(ind.nodes)
# the vector of per-individual heterozygosities:
ind_het = ts.diversity(ind_nodes, mode="site")
mean_het = np.mean(ind_het)
return mean_het There is interest in computing F_IS, as well. Referring to wikipedia,
where
So, we can define def F_IS(ts, sample_sets):
mean_het = mean_heterozygosity(ts)
pi = ts.diversity(mode="site")
F_IT = 1 - mean_het / pi
F_ST = ts.Fst(sample_sets, mode="site")
return 1 - (1 - F_IT) / (1 - F_ST) Note that this would be a just fine way to implement this in tskit if anyone wants to do it. It would be nice to do it in a more efficient way - if there are a lot of individuals in the tree sequence, this will take longer than it needs to, but it should be fine in many cases. Comments or corrections welcome! |
I've been studying F_ST this year so this sounds like a good study project/homework for me to do. I've been reading about F_ST material in Weir's Genetic Data Analysis II book and the foundational papers by Wright, Cockerham, Hill and Weir. I've also written up a very formal mathy definition in F_ST: http://www.castedo.com/osa/136/ Does it make sense that this be coded into Python so it is most readable or in C so that it is more performant? |
@castedo There's an example in the dev docs of how to code in python first (as you need it for the tests) then once that has been through review to check it's doing the right thing, you do it in C. |
@benjeffery Thanks for that guidance, I'll follow that. Not sure what the usual process is for assigning issues. I just assigned this to myself. I'm reading the paper "Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes" before starting to work on this. This looks like the high-level background and motivation for this area of the code. Feel free to ping me if anyone wants to know how far along I am on this. If anybody knows of specific computation scenarios where speed of computing this stat is relevant, let me know. Doesn't need to be something so well established as a benchmark. Just anything that is vaguely realistic. I just want to avoid imagining a perf scenario that doesn't really exist in research. |
Thanks @castedo, that sounds ideal. The benchmarks we did in the paper you mentioned will give a good idea of the sort of scenarios we're imagining, we'd be applying individual stats in fairly similar cases. |
This is excellent, ping for input on how to proceed! But sounds like you are going on the right direction. |
I see many directions in the scope of this issue #166 :
I can express a number of mathematical thoughts regarding B) and C). From a mathematical standpoint I see a number of intersting proproperties of these calculations which maybe folks would like to see. But I think B) and C) would be best tracked separately and made out of scope for this issue #166. Especially with the F statistics I see muliple definitions in the liturature depending on the model being estimated, whether the F statistic is an estimator or a proability model parameter (to be estimated) and how F statistics are combined across sites. So that leaves A) and D) for which I'm torn on how to proceed. My inclination is to proceed on A) since it's focused and probably not a lot of work and discussion. For A) I can just do it and get it done. On the other hand, part of me thinks D) is the right thing to do. If folks would like to discuss why I think D) might be the right thing to do, just holler. @petrelharp, input on proceeding? I've read "Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes" (great work!) and now I'm going to do the tskit tutorial. I do not yet see how non-autosomal chromosome pairs are handled in tskit (especially the pseudo-autosomal regions of the XY chromsomes). I think I'll see soon. If not, any pointers or tips appreciated. |
@castedo my understannding is that there is no notion of calculating statistics per individual, yet, so there are only haploid chromosomes everywhere. @petrelharp correct me if I am wrong. |
@gregorgorjanc one can view individuals as very small sample sets (sub-populations)(i.e. each individuals is a tiny weeny population of haploids/monoploids). So from that view, there are statistics, since there are statistics for sample sets. But yes, not at a high-level. |
Let's see! First, ploidy is naturally encoded based on how many chromosomes an individual has. And I'm not aware of any individual-level statistics that are defined differently for, or specific to, non-diploids. For instance, heterozygosity is "what's the chance two randomly chosen alleles match"; this works fine for polyploids, and for haploids the value is I agree, I think we should get an API going, for (A). Note: by "heterozygosity" here we mean "observed heterozygosity", because there's also "expected heterozygosity", which is the def observed_heterozygosity(self, individual_sets, **kwargs):
out = []
for iset in individual_sets:
het = self.diversity(sample_sets=[self.individual(i).nodes for i in iset], **kwargs)
out.append(np.mean(het, axis=<the axis that averages across individuals>))
return np.array(out) I toyed with the idea of just adding Thinking about (D) it would help to have a list of what other individual-level statistics we want. As for F_IS I think we should not be afraid to pick what we think is the right definition; and we can always add others later by adding a |
@petrelharp thx! Quick question on the first part:
Suppose a human zygote with Klinefelter syndrome (XXY) splits into almost identical twins where by some freak event, one twin gets copies of both distinct X chromosomes but not the Y and the other twin gets just one X chromosome and the Y. So they are identical twins normal in every way except one is female and the other male missing one of the X chromosomes of the sister twin. Is there a consistent approach in the field as how heterozygosity is defined on the sex chromosomes such that one can say the female twin most likely has higher vs lower vs equal heterozygosity as the male twin? It would make sense to me that the male twin would have much higher heterozygosity because the Y chromosome would essentially be accounted as being an X with a crazy massive set of mutations. It would not make sense to me to say heterozygosity is undefined for the sex chromsomes in the male twin or that the male twin has lower heterozygosity. But that's me, any sense what is normal in the field? Also @gregorgorjanc |
And for simplicity of illustration, let say heterozygosity is 0.1 across all the non-sex chromosomes and 0.2 for the X and the pseudo-autosomal regions. And for numerical simplicity, the X chromosome is weighted at exactly 5% in the female genome. So I think the female twin would have heterozygosity of 0.105 = (0.95 * 0.1 + 0.05 * 0.2). And the question is whether there is consistent definition in the field that would define heterozygosity for the male twin. |
Hm, well, it's safe to say that there is no standard definition of heterozygosity in that situation. But, I think that this would not be the way forward:
|
I don't think we can cover every possible corner case here @castedo - nature is pretty creative with coming up with weird stuff. "Does what you'd want in most cases" is what we're looking for initially. If there's a need to cover the more exotic cases then we can deal with that later. |
So wrapping up on the scope of this:
With @petrelharp's
Handling sex chromosomes is way out of scope right now. There are challenging issues in trying to deal with this and I'll not worry about.
"observed heterozygosity" has a consistent meaning in the field (I think, I hope!) and @petrelharp's clatification is a totally logical extension of it and NaN is the value for monoploids. (As a side note, this definition of heterozygosity seems sensible for diploids, because it has a max value of 1. But it seem wonky to me to use it rather than 'gene diversity' (see #961), as a statistic to use across many types of ploidy. Just a sort of esthetic personal reaction.)
Hammer out the details in (or after) #858 and make a separate GitHub issue to track implementing F_IT and F_IS.
This is too hard to hash out now in Github issues. More appropriate for brainstorming face-(to-screen-to-)face and/or after all this other stuff become more clear. |
It's been 5 months now, so I should acknowledge I'm unlikely to get to this (so un-assigning myself). I hope I didn't cause too much useless noise here. Sorry about that if so. |
No worries @castedo! Thanks for updating the issue. |
Following on the algorithm for computing additive breeding values in #1547, here's an outline of an algorithm to compute the number of homozygous mutations per individual (which we could then use to compute individual heterozygosity). Or, more generally, given a vector In the algorithm described here, partial sums for haplotypes are propagated down the tree, until they reach the samples. But now, output corresponds to individuals, not to samples; and furthermore mutations only count towards the output if they are inherited by the MRCA of the nodes of the individual. So, we can use that algorithm, nearly unchanged, as long as we also maintain, for each node in the tree, the set of individual indices for which that node is the MRCA. The output will be
Step 1b is the hard one - I'm not clear on how efficiently it's possible to do that yet. |
Sounds good @petrelharp. Maybe worth knocking up a simple Python version where we don't think too hard about efficiency? |
As this is showing up in the SLiM list, here's how to compute individual heterozygosities:
|
Is there a neat way to get the mean here over all individuals, other than doing |
Hello, I'm trying to do some heterozygosity calculations with tskit and I just wanted to confirm that I'm interpreting things correctly from this thread discussion. If I want to calculate observed heterozygosities per individual in my population, I could do something like this:
(where population is an array of individual IDs) And if I want to calculate expected heterozygosity of the population, I could do:
(where population_nodes is an array of nodes IDs for all individuals in the population) Is that correct? |
Something like
should do it - very simple, but we want to make it easy and natural for people to deal with individuals.
The text was updated successfully, but these errors were encountered: