-
Notifications
You must be signed in to change notification settings - Fork 77
diversity calculation with vs without replacement #961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I don't have much of an opinion on whether we need this, but in general I'd rather not have a profusion of similarly named functions which do almost the same thing, so I'd be more in favour of adding parameters to control the properties of these functions. For things like Fst, there's a million and one different ways of computing this, so I think we'll want to have a keyword argument My personal preference would be that the functions compute the mathematical definition of the statistic by default, and to make the various statistical estimators available using this approach. |
Ah ha. I think you're aware of this, but restating: the difference here is whether you treat diversity as a property of the sample or as a property of a larger population that we're using the sample to estimate? The way we're currently defining most of our statistics, what we have is an unbiased estimator of the population quantity. That seemed like a nice property to me, as a statistician, and since we usually are working with small samples, and not the entire population. This is analogous to the factor of n or n-1 in the standard deviation - the statistician's standard deviation has a n-1 in the denominator, since that's what gives you an unbiased estimator of the population standard deviation, but some people leave this out, because they're just computing the SD of the, like, list of numbers they've got. So, on theoretical grounds, I'm strongly in favor of the version we've got. Suppose you have simulated a big population, and the diversity in this population is If there really are people wanting to compute the other version, then sure, we could provide a Rather than |
Agreed. |
@petrelharp good you mention this because for whatever reason I haven't been looking at it this way ... I've been fixated on different terms ("nucleotide diversity" vs "gene diversity") and formal math definitions I see in different papers and books. I guess in some way I've been viewing this not like a statistician in the real world where data is messy and numbers don't line up perfectly, in contrast to an accountant or mathematician where I'm expecting numbers and expressions to line up perfectly.
good data point on what some proportion users might think
I'll do a quick survey of what various other libraries are doing regarding sample vs population variance and include the results next. It's a little surprising! Just as some data points for further discussion. |
OK, here's some data points on other libraries. I'll add my own commentary separately. This is just for our own info. Perhaps the library "closest" in some sense to tskit is numpy which has https://numpy.org/doc/stable/reference/generated/numpy.var.html Then the builtin Now here's what I found really interesting: The In Everybody appears to do a great job doing something different! 😄 |
My straight to the point summary of what I see now is that issue #961 should get resolved with merely clarifications in the documentation and NOT add code to the
@petrelharp's question here seems the key question. From what @petrelharp has shared I'm guessing the vast majority of users want the unbiased estimator. The evidence I've been thinking of is mainly what I've written up in #858 in how the Below are some verbose thoughts to share for whoever might find it interested. It's part of the reason I think it's best to just stick to documentation clarifications for now. <verbose> From what I'm learning from @petrelharp, I suspect the In the case of |
BTW, in the numpy-versus-R distinction for how to compute variance, my opinion is that R does clearly the correct thing - there's a good reason that's the formula given in all introductory stats textbooks. I think the numpy authors are coming from physics? I was actually going to bring this up as an example, but I'd mis-remembered that numpy had fixed this glaring error. |
@petrelharp I go with your sense of what is clearly correct as what the majority of users of tskit ... that said ... 😁 ... I will defend the choice of defining "variance" as the population version and not the sample version. I wish I could do that over beers or lunch or something fun. I'll just have to settle for email. 😢 |
I believe this has been sorted out, if not here then in other discussions since. |
I was tempted to title this issue 'diversity, equality and inclusion' because the following is actually all about diversity, equality and inclusion! But I figured that would just mean to all the people who click through expecting something TOTALLY different 🤣. Anyhoo ...
I think there is a risk the current naming and behaviour of the function
diversity
can generate undesirable results.I suspect due to different definitions using the word 'diversity', it's natural to choose this function to calculate "exepected heterozygosity". But I believe this will result in incorrect results.
As a concrete example case, consider one biallelic locus and the following population of two heterozygous diploid individuals: { Aa, Aa }. Per Wikipedia, the expected heterozygosity is 1/2 but the
diversity
function will return 2/3.@petrelharp's definition of heterozygosity I thought a great framework:
... are not equal. I think the crux of the issue is whether this random choosing is with vs without replacement in a definition of 'diversity'. Or in other terms, is
(n-1)
or justn
used in the calculation.Per Wikipedia and https://doi.org/10.1534/genetics.120.303253 🥇 👍 'nucleotide diversity' is withOUT replacement. But looking at "Genetic Data Analysis II" by Weir I see 'gene diversity' defined WITH replacement. Ditto for Nei (doi://10.1073/pnas.70.12.3321) where 'gene diversity' is defined WITH replacement. This definition of 'gene diversity' (with replacement) does equal 'expected heterzygosity'.
Finally, on a third but less relevant point, I think the formula using
n
instead ofn-1
has superior mathematical properties that are lost when usingn-1
in the numerator. I can write up more details if anyone is interested in more details.Some enhancements I can throw out:
A) add a flag to
diversity
to flip between with vs without replacementB) have two functions named
nucleotide_diversity
andgene_diversity
for the without and with replacement versions respectivelyI think A) is more likely to lead to confusion, so B) seems a better long-term solution. B) is more clearly reusing existing terms in use, whereas I suspect coming up with a flag name in A) will not used existing terminology.
The text was updated successfully, but these errors were encountered: