Fst function #225

jeromekelleher · 2020-09-01T15:39:33Z

Fst is one of the fundamental building blocks of population structure analysis. We will want to compute Fst between pairs of populations, with the "population" labels designated by some properties of the input dataset (see #224).

There are a number of different estimators for Fst (scikit-allel implements 3), so we should provide a method to specify the estimator for the statistic as a parameter. I suggest something like the following:

def Fst(ds, *, estimator=None, **kwargs):
    if estimator = None:
        estimator = "hudson"
    estimator_map = {
        "hudson": hudson_Fst,
        "weir_cockerham": wc_Fst,
        "patterson": patterson_Fst
    } 
    return estimator_map[estimator](ds, **kwargs)

These correspond to the three definitions in scikit-allele. We may not want all three initially, and just implementing the Hudson estimator may be sufficient. We can test our implementations by comparing with scikit-allele and tskit

(ps. I prefer to use None as the default value for estimator, as there may be situations in the future where we might prefer to have a different default depending on properties of the dataset. If we leave estimator="hudson" in the signature, then there's no way to tell if the user just wants the default or has specifically asked for "hudson". In general, unless we're totally sure that the default is never going to change, I think it's better to use None as the default value in the signature.)

The text was updated successfully, but these errors were encountered:

jeromekelleher · 2020-09-01T15:40:42Z

ps. Also I find having a capital letter in the function name ugly, but I'm not sure what the alternative is for these popgen stats.

jeromekelleher · 2020-09-01T15:43:44Z

This should be done once the basic popgen stats are implemented (#100).

timothymillar · 2020-09-03T09:45:09Z

I'd like to add the estimator described by Harris and DeGiorgio (2016) to the list at some point. Their estimator is based on Hudson's but attempts to correct for related and inbred individuals using kinship.

tomwhite · 2020-09-03T14:11:46Z

Also I find having a capital letter in the function name ugly, but I'm not sure what the alternative is for these popgen stats.

Just using lowercase would be an option (scikit-allel does).

tomwhite · 2020-09-22T09:33:55Z

I noticed that the implementation of Fst we have at the moment (based on tskit) is defined as

1 - 2 * (d(X) + d(Y)) / (d(X) + d(Y) + 2 * d(X, Y))

whereas scikit-allel's hudson_fst function defines it as

1 - (d(X) + d(Y)) / (2 * d(X, Y))

Does tskit's Fst correspond to one of the other implementations from scikit-allel, or is it a separate one entirely?

Should we change Fst in sgkit to return results that are the same as scikit-allel's hudson_fst function by default? This may be important to get agreement with PBS which uses Fst in its calculation.

jeromekelleher · 2020-09-23T07:46:50Z

Good question @tomwhite! I've opened an issue on tskit (tskit-dev/tskit#858) to clarify this.

I'm guessing that we should be using Hudson's estimator in sgkit by default (see above for suggested notation for specifying different estimators), and so we should change to using scikit-allel's definition (and test against hudson_fst)

tomwhite · 2020-10-01T14:08:32Z

Thanks @jeromekelleher. I've opened #292 to implement this. What name do you think we should use for the tskit estimator? The linked discussion had a lot of names!

jeromekelleher · 2020-10-01T15:16:15Z

Sounds like Nei1986 is the right name for it, but we'll need to confirm that. Maybe we could have a comment in the code linking to the tskit issue?

tomwhite · 2020-10-15T13:31:06Z

Closing this, since we have Fst now, from #100 and #292

hammer added the core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. label Sep 1, 2020

This was referenced Sep 1, 2020

Requirements for population structure analysis #226

Open

PBS selection statistic #230

Closed

This was referenced Sep 8, 2020

Update popgen stats to compute for all variants #238

Closed

Update Fst and Tajima's D functions to use grouping convention #240

Closed

tomwhite mentioned this issue Oct 1, 2020

Implement Hudson estimator for Fst #292

Closed

tomwhite closed this as completed Oct 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fst function #225

Fst function #225

jeromekelleher commented Sep 1, 2020 •

edited

Loading

jeromekelleher commented Sep 1, 2020

jeromekelleher commented Sep 1, 2020

timothymillar commented Sep 3, 2020

tomwhite commented Sep 3, 2020

tomwhite commented Sep 22, 2020

jeromekelleher commented Sep 23, 2020

tomwhite commented Oct 1, 2020

jeromekelleher commented Oct 1, 2020

tomwhite commented Oct 15, 2020

Fst function #225

Fst function #225

Comments

jeromekelleher commented Sep 1, 2020 • edited Loading

jeromekelleher commented Sep 1, 2020

jeromekelleher commented Sep 1, 2020

timothymillar commented Sep 3, 2020

tomwhite commented Sep 3, 2020

tomwhite commented Sep 22, 2020

jeromekelleher commented Sep 23, 2020

tomwhite commented Oct 1, 2020

jeromekelleher commented Oct 1, 2020

tomwhite commented Oct 15, 2020

jeromekelleher commented Sep 1, 2020 •

edited

Loading