Skip to content

Fst function #225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jeromekelleher opened this issue Sep 1, 2020 · 9 comments
Closed

Fst function #225

jeromekelleher opened this issue Sep 1, 2020 · 9 comments
Labels
core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc.

Comments

@jeromekelleher
Copy link
Collaborator

jeromekelleher commented Sep 1, 2020

Fst is one of the fundamental building blocks of population structure analysis. We will want to compute Fst between pairs of populations, with the "population" labels designated by some properties of the input dataset (see #224).

There are a number of different estimators for Fst (scikit-allel implements 3), so we should provide a method to specify the estimator for the statistic as a parameter. I suggest something like the following:

def Fst(ds, *, estimator=None, **kwargs):
    if estimator = None:
        estimator = "hudson"
    estimator_map = {
        "hudson": hudson_Fst,
        "weir_cockerham": wc_Fst,
        "patterson": patterson_Fst
    } 
    return estimator_map[estimator](ds, **kwargs)

These correspond to the three definitions in scikit-allele. We may not want all three initially, and just implementing the Hudson estimator may be sufficient. We can test our implementations by comparing with scikit-allele and tskit

(ps. I prefer to use None as the default value for estimator, as there may be situations in the future where we might prefer to have a different default depending on properties of the dataset. If we leave estimator="hudson" in the signature, then there's no way to tell if the user just wants the default or has specifically asked for "hudson". In general, unless we're totally sure that the default is never going to change, I think it's better to use None as the default value in the signature.)

@jeromekelleher
Copy link
Collaborator Author

ps. Also I find having a capital letter in the function name ugly, but I'm not sure what the alternative is for these popgen stats.

@jeromekelleher
Copy link
Collaborator Author

This should be done once the basic popgen stats are implemented (#100).

@hammer hammer added the core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. label Sep 1, 2020
@timothymillar
Copy link
Collaborator

I'd like to add the estimator described by Harris and DeGiorgio (2016) to the list at some point. Their estimator is based on Hudson's but attempts to correct for related and inbred individuals using kinship.

@tomwhite
Copy link
Collaborator

tomwhite commented Sep 3, 2020

Also I find having a capital letter in the function name ugly, but I'm not sure what the alternative is for these popgen stats.

Just using lowercase would be an option (scikit-allel does).

@tomwhite
Copy link
Collaborator

I noticed that the implementation of Fst we have at the moment (based on tskit) is defined as

1 - 2 * (d(X) + d(Y)) / (d(X) + d(Y) + 2 * d(X, Y))

whereas scikit-allel's hudson_fst function defines it as

1 - (d(X) + d(Y)) / (2 * d(X, Y))

Does tskit's Fst correspond to one of the other implementations from scikit-allel, or is it a separate one entirely?

Should we change Fst in sgkit to return results that are the same as scikit-allel's hudson_fst function by default? This may be important to get agreement with PBS which uses Fst in its calculation.

@jeromekelleher
Copy link
Collaborator Author

Good question @tomwhite! I've opened an issue on tskit (tskit-dev/tskit#858) to clarify this.

I'm guessing that we should be using Hudson's estimator in sgkit by default (see above for suggested notation for specifying different estimators), and so we should change to using scikit-allel's definition (and test against hudson_fst)

@tomwhite
Copy link
Collaborator

tomwhite commented Oct 1, 2020

Thanks @jeromekelleher. I've opened #292 to implement this. What name do you think we should use for the tskit estimator? The linked discussion had a lot of names!

@jeromekelleher
Copy link
Collaborator Author

Sounds like Nei1986 is the right name for it, but we'll need to confirm that. Maybe we could have a comment in the code linking to the tskit issue?

@tomwhite
Copy link
Collaborator

Closing this, since we have Fst now, from #100 and #292

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc.
Projects
None yet
Development

No branches or pull requests

4 participants