Skip to content

diversity calculation with vs without replacement #961

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
castedo opened this issue Oct 31, 2020 · 9 comments
Closed

diversity calculation with vs without replacement #961

castedo opened this issue Oct 31, 2020 · 9 comments

Comments

@castedo
Copy link
Member

castedo commented Oct 31, 2020

I was tempted to title this issue 'diversity, equality and inclusion' because the following is actually all about diversity, equality and inclusion! But I figured that would just mean to all the people who click through expecting something TOTALLY different 🤣. Anyhoo ...

I think there is a risk the current naming and behaviour of the function diversity can generate undesirable results.

I suspect due to different definitions using the word 'diversity', it's natural to choose this function to calculate "exepected heterozygosity". But I believe this will result in incorrect results.

As a concrete example case, consider one biallelic locus and the following population of two heterozygous diploid individuals: { Aa, Aa }. Per Wikipedia, the expected heterozygosity is 1/2 but the diversity function will return 2/3.

@petrelharp's definition of heterozygosity I thought a great framework:

"what's the chance two randomly chosen alleles ..."

... are not equal. I think the crux of the issue is whether this random choosing is with vs without replacement in a definition of 'diversity'. Or in other terms, is (n-1) or just n used in the calculation.

Per Wikipedia and https://doi.org/10.1534/genetics.120.303253 🥇 👍 'nucleotide diversity' is withOUT replacement. But looking at "Genetic Data Analysis II" by Weir I see 'gene diversity' defined WITH replacement. Ditto for Nei (doi://10.1073/pnas.70.12.3321) where 'gene diversity' is defined WITH replacement. This definition of 'gene diversity' (with replacement) does equal 'expected heterzygosity'.

Finally, on a third but less relevant point, I think the formula using n instead of n-1 has superior mathematical properties that are lost when using n-1 in the numerator. I can write up more details if anyone is interested in more details.

Some enhancements I can throw out:

A) add a flag to diversity to flip between with vs without replacement

B) have two functions named nucleotide_diversity and gene_diversity for the without and with replacement versions respectively

I think A) is more likely to lead to confusion, so B) seems a better long-term solution. B) is more clearly reusing existing terms in use, whereas I suspect coming up with a flag name in A) will not used existing terminology.

@jeromekelleher
Copy link
Member

I don't have much of an opinion on whether we need this, but in general I'd rather not have a profusion of similarly named functions which do almost the same thing, so I'd be more in favour of adding parameters to control the properties of these functions. For things like Fst, there's a million and one different ways of computing this, so I think we'll want to have a keyword argument method or estimator to define how it's computed (see #858)

My personal preference would be that the functions compute the mathematical definition of the statistic by default, and to make the various statistical estimators available using this approach.

@petrelharp
Copy link
Contributor

Ah ha. I think you're aware of this, but restating: the difference here is whether you treat diversity as a property of the sample or as a property of a larger population that we're using the sample to estimate? The way we're currently defining most of our statistics, what we have is an unbiased estimator of the population quantity. That seemed like a nice property to me, as a statistician, and since we usually are working with small samples, and not the entire population. This is analogous to the factor of n or n-1 in the standard deviation - the statistician's standard deviation has a n-1 in the denominator, since that's what gives you an unbiased estimator of the population standard deviation, but some people leave this out, because they're just computing the SD of the, like, list of numbers they've got.

So, on theoretical grounds, I'm strongly in favor of the version we've got. Suppose you have simulated a big population, and the diversity in this population is pi (doesn't matter which version you use, the population is big), and you compute diversity using a subsample of size n; without replacement, the expected value of diversity of the sub-sample is pi; but if you computed it with replacement, the expected value would be pi * (1 - 1/n).

If there really are people wanting to compute the other version, then sure, we could provide a with_replacement flag. (and, it'd be easy: we'd just need to multiply by 1-1/n.) But it'd be nice to see evidence that people actually do, first? Like, is this what some other software packages do? Note that Hahn, Molecular Population Genetics, defines pi without replacement. And, beware that sometimes people will write 2 p (1-p) but then when computing this for finite samples they would quietly do the without-replacement calculation because "well of couse, that's how you get an unbiased estimator". (For instance, see the original definitions of f2 and f3.)

Rather than with_replacement, though, I'd vote for just a method= argument, common to many statistics methods, that would default to the current one.

@jeromekelleher
Copy link
Member

Rather than with_replacement, though, I'd vote for just a method= argument, common to many statistics methods, that would default to the current one.

Agreed.

@castedo
Copy link
Member Author

castedo commented Nov 2, 2020

... I think you're aware of this ... property of the sample or as a property of a larger population

@petrelharp good you mention this because for whatever reason I haven't been looking at it this way ... I've been fixated on different terms ("nucleotide diversity" vs "gene diversity") and formal math definitions I see in different papers and books. I guess in some way I've been viewing this not like a statistician in the real world where data is messy and numbers don't line up perfectly, in contrast to an accountant or mathematician where I'm expecting numbers and expressions to line up perfectly.

do the without-replacement calculation because "well of couse, that's how you get an unbiased estimator"

good data point on what some proportion users might think

is this what some other software packages do?

I'll do a quick survey of what various other libraries are doing regarding sample vs population variance and include the results next. It's a little surprising! Just as some data points for further discussion.

@castedo
Copy link
Member Author

castedo commented Nov 2, 2020

OK, here's some data points on other libraries. I'll add my own commentary separately. This is just for our own info.

Perhaps the library "closest" in some sense to tskit is numpy which has numpy.var default to population (n) variance, not sample (n-1) varaince, and then there is a ddof param which can be set to 1 to get sample variance.

https://numpy.org/doc/stable/reference/generated/numpy.var.html

Then the builtin statistics Python library does not opt for the parameter-with-default design and instead has two separately named function variance and pvariance. This is like Excel which has separately named VAR and VARP functions.

Now here's what I found really interesting:

The pandas library which is built on top of numpy has DataFrame.var also with ddof param but defaults the other way to sample variance by defaulting to ddof=1. Since pandas was inspired by R data frames perhaps the following is the reason.

In R the var() function ONLY does sample variance and doesn't have a parameter or similarly named alternative function for population variance at all.

Everybody appears to do a great job doing something different! 😄

@castedo
Copy link
Member Author

castedo commented Nov 2, 2020

My straight to the point summary of what I see now is that issue #961 should get resolved with merely clarifications in the documentation and NOT add code to the diversity function, at least for now.

If there really are people wanting to compute the other version ... it'd be nice to see evidence that people actually do, first?

@petrelharp's question here seems the key question. From what @petrelharp has shared I'm guessing the vast majority of users want the unbiased estimator. The evidence I've been thinking of is mainly what I've written up in #858 in how the Fst function is implemented and documented. Regardless of what's desirable for #858 I'm guessing that's not a normal usage scenario.

Below are some verbose thoughts to share for whoever might find it interested. It's part of the reason I think it's best to just stick to documentation clarifications for now.

<verbose>
Whether an API function calculates an unbiased estimator or the theoretical model parameter seems to me to hinge on whether the function is a component or compound. Or in more fun terms is it unsalted or salted butter. Is it a basic ingredient for the baker making bread or is it a consumable to be spread on bread to be immediately eaten.

From what I'm learning from @petrelharp, diversity is more like salted butter. It's more a compound that includes the bias adjustment because users desire the end result of an unbiased estimate. This is analagous to how R only provides the unbiased estimator of variance.

I suspect the R developers felt that the functions sum and length are the appropriate components (unsalted butter).

In the case of tskit I'm guessing the general_stat and sample_count_stat are the unsalted butter for bakers that want components without bias correction. Or maybe those aren't satisfactory and a different API interface is desirable. Regardless it seems highly suggestive that adding a parameter to diversity is not the right interface for a more general purpose reusable component function.
</verbose>

@petrelharp
Copy link
Contributor

BTW, in the numpy-versus-R distinction for how to compute variance, my opinion is that R does clearly the correct thing - there's a good reason that's the formula given in all introductory stats textbooks. I think the numpy authors are coming from physics? I was actually going to bring this up as an example, but I'd mis-remembered that numpy had fixed this glaring error.

@castedo
Copy link
Member Author

castedo commented Nov 2, 2020

@petrelharp I go with your sense of what is clearly correct as what the majority of users of tskit diversity expect. So going with your opinion I think best follows the Principle of Least Astonishment. I don't know about the numpy audience but seems like they are violating this principle too.

... that said ... 😁 ... I will defend the choice of defining "variance" as the population version and not the sample version. I wish I could do that over beers or lunch or something fun. I'll just have to settle for email. 😢

@petrelharp
Copy link
Contributor

I believe this has been sorted out, if not here then in other discussions since.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants