Skip to content

Add 'PBS()' #2717

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
atongsa opened this issue Feb 27, 2023 · 11 comments
Open

Add 'PBS()' #2717

atongsa opened this issue Feb 27, 2023 · 11 comments
Labels
enhancement New feature or request Python API Issue is about the Python API

Comments

@atongsa
Copy link

atongsa commented Feb 27, 2023

PBS (population branch statistics) 1 is derived from Fst, which can detect positive selection signals in admixted populations .

the formula is

T = - log(1 – FST)

PBS = (T_TH + T_TD − T_HD) / 2

Footnotes

  1. Yi, X. et al. Sequencing of 50 Human Exomes Reveals Adaptation to High Altitude. Science 329, 75–78 (2010).

@jeromekelleher
Copy link
Member

A pull request would be very welcome if you'd like to contribute @atongsa! If you're interested, we'd be happy to provide some concrete pointers on where to start. (The development docs should get you started, in the first instance)

@jeromekelleher jeromekelleher added enhancement New feature or request Python API Issue is about the Python API labels Feb 27, 2023
@atongsa
Copy link
Author

atongsa commented Feb 28, 2023

thank you, sir. I am learning the development docs and will try to add a PBS() function to tskit.

@atongsa
Copy link
Author

atongsa commented Mar 2, 2023

i imagine the ts.PBS() logic is like,

  1. read vcf and 3 pops

  2. use the ts.fst() to get 3 pair fst and then calculate the PBS for all 3 branch

but i dont find ways to read vcf into tree sequence (just write_vcf) in tskit document, can you give me some prompt

@jeromekelleher
Copy link
Member

As it's a "derived" statistic, I would imagine the implementation would look something like that of Fst. However, it may be that it's not as complicated as that.

Perhaps you could try implementing this as a standalone function based on some simple simulations, and paste your code in here?

We wouldn't be working with VCF here, just tskit tree sequence objects.

@hyanwong
Copy link
Member

hyanwong commented Mar 7, 2023

but i dont find ways to read vcf into tree sequence (just write_vcf) in tskit document, can you give me some prompt

Just FYI @atongsa, there is a long and complicated "inference" procedure to try and construct a (sometimes) reasonable tree sequence from the much more limited genetic variation data present in a VCF (e.g. via tsinfer). Moreover, even once you manage to infer a tree sequence, to calculate branch length stats like Fst, you need to date the tree sequence too (e.g. via tsdate).

So to implement stats calculations in tskit, you'll want to work with directly simulated tree sequences, not VCF data.

@jeromekelleher
Copy link
Member

Also, if you do have VCF data then you'd probably be better off using something like sgkit which has an implementation of PBS

@atongsa
Copy link
Author

atongsa commented Mar 23, 2023

i write a demo function, and i will improve it later for sites and windows calculation (because i don't understand some function in trees.py, i need some time to learn), and then try to make a tskit.pbs() pull request

def pbs(ts, test, ref1, ref2):
    """
    Calculates the population branch statistic (PBS) between a test population and two reference populations.
    Parameters
    ----------
    ts: tree sequence
    test : numpy.ndarray, nodes as the focus population
    ref1 : numpy.ndarray, nodes as the far away population
    ref2 : numpy.ndarray, nodes as the sister group of focus population
    Returns
    -------
        PBS values of the focus population
    """
    # Calculate pairwise Fst values
    fst_test_ref1 = ts.Fst([test, ref1])
    fst_test_ref2 = ts.Fst([test, ref2])
    fst_ref1_ref2 = ts.Fst([ref1, ref2])
    # Calculate PBS
    pbs = (-math.log(1-fst_test_ref1, 10) + -math.log(1-fst_test_ref2, 10) - -math.log(1-fst_ref1_ref2, 10)) / 2
    return pbs

the test code below get a return value 0.1198192924722683/2

import math
import tskit
import numpy as np

ts = msprime.sim_ancestry(3, recombination_rate=0.1, sequence_length=10, random_seed=42)
mts = msprime.sim_mutations(ts, rate=0.1, random_seed=42)

p1 = mts.samples()[[0,5]]
p2 = mts.samples()[[2,3]]
p3 = mts.samples()[[1,4]]

pbs(mts,p1,p2,p3)

@petrelharp
Copy link
Contributor

For reference, here's the definition of PBS:
Screenshot from 2023-03-23 08-48-14

The code looks right, although I think you're missing a factor of 2? And, it could be slightly more concise if you used the indexes argument to Fst.

(It's also interesting to note that PBS is framed almost in terms of just a branch length; it's worth thinking about what a tree-based analogous thing would be, although I'm not suggesting that for this issue.)

This was referenced Apr 28, 2023
@petrelharp petrelharp mentioned this issue Jul 6, 2023
3 tasks
@petrelharp
Copy link
Contributor

Working through the motivation and implementation in #2777 here: so, the statistic is defined in terms of some variables called T; these are -log(1-Fst), which are "divergence time in units scaled by the population size"; later on in the document it says that the log transform "places branches of different magnitudes on the same scale". In the paper they say they computed Fst using the method of Reynolds; OF COURSE, Reynolds et al do not actually use the notation "Fst" anywhere. However, they discuss estimators of the 'coancestry coefficient', which is defined to be the probability of identity between two alleles drawn from the same population, and provide a ratio of variances estimate for this; this is (one of the usual definitions of) Fst. And, they have this:
Screenshot from 2023-07-06 08-24-21

The main question I have at this point is whether we should really be using T = -log(1-Fst) in our definition of PBS or if we should be using T = divergence.

Now, our Fst is implemented as

Fst = 1 - 2 * (d(X) + d(Y)) / (d(X) + 2 * d(X, Y) + d(Y))

and so letting d(X+Y) = (d(X) + 2 * d(X, Y) + d(Y)),

-log(1 - Fst) = -log(2 * (d(X) + d(Y)) / d(X+Y))
  = log(d(X+Y)) - log(d(X) + d(Y)) - log(2)

and so plugging this in to the definition of PBS we'd get that

PBS(T,H,D) = (
     log(d(T+H)) - log(d(T) + d(H)) - log(2)
     + log(d(T+D)) - log(d(T) + d(D)) - log(2)
     - log(d(H+D)) + log(d(H) + d(D)) + log(2)
   ) / 2
 = log( 
     ( d(T+H) * d(T+D) * (d(H) + d(D)) )
     / ( 2 * (d(T) + d(H)) * (d(T) + d(D)) * d(H+D) )
) / 2

I'm not seeing any particular simplification or insight here, just writing it down.
However, notice that since (as defined here) Fst is a unitless quantity, PBS is as well, even though it's defined with some variables named T which suggest "time". In the last expression we see this because it's the log of a ratio of three products of times. This is quite different to what we'd get if we used divergence directly for T.

Okay - I think defining PBS in terms of Fst is what we ought to do - at least, it agrees with the literature. It would be interesting to look at the other statistic we'd get by just putting in divergences (I think that's been called a "y-statistic"?), but that's a separate topic.

@currocam
Copy link
Contributor

Hi,
I'm interested in simulating PBS values with msprime under different demographic histories. I searched for whether such a thing was possible without converting into VCF and found this issue. I saw that the PRs to include the method in tskit were closed due to inactivity recently (#2777)

I'm willing to do such PR myself, but I'm unsure about how the Fst of tskit relates with the Reynold's one (the one mentioned in the original paper). Then, I read #858 , got confused and I don't know if it would make sense to implement it in tskit anymore.

My understanding is that the PBS is calculated from samples of three different populations A, B and C such:

demography = msprime.Demography()
demography.add_population(name="Ancient", initial_size=4_000)
demography.add_population(name="A", initial_size=1_000)
demography.add_population(name="B", initial_size=1_000)
demography.add_population(name="AB", initial_size=2_000)
demography.add_population(name="C", initial_size=2_000)
demography.add_population_split(time=1000, derived=["A", "B"], ancestral="AB")
demography.add_population_split(time=6_000, derived=["AB", "C"], ancestral="Ancient")
# Simulate dataset
ts = msprime.sim_ancestry(
    samples={"A": 25, "B": 25, "C" : 25},
    demography=demography,
    sequence_length=1e8,
    recombination_rate=1e-8,
    random_seed=12
)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=12)

Demography

If such PBS method would be added, I would expect it to be equivalent to the following:

# Find samples ids of populations A, B and C
# These are indexes 1, 2 and 4 here
samples_a = [node.id for node in ts.nodes() if node.is_sample() and node.population == 1]
samples_b = [node.id for node in ts.nodes() if node.is_sample() and node.population == 2]
samples_c = [node.id for node in ts.nodes() if node.is_sample() and node.population == 4]
# Compute matrix of Fst
fsts = ts.Fst(
    [samples_a, samples_b, samples_c],
    indexes=[(0, 1), (0, 2), (1, 2)],
    windows="sites", 
    mode='site',
    span_normalise=False
)
assert fsts.shape == (ts.num_sites, 3)
Ts = -np.log(1-fsts)
tskit_pbs = 0.5*(Ts[:, 0]+Ts[:, 1] - Ts[:, 2])

However, when compared to the ones you would calculate with something like sgkit (here using the default Hudson Fst estimator, I think), the values don't match very well. I guess this is expected, as Fst's estimators are different, but it was somewhat surprising to me (and perhaps to others).

sg_versus_tskit

I've attached a jupyter notebook in pdf with all the necessary code to reproduce.
pbs_tskit.pdf

@petrelharp
Copy link
Contributor

Hi, @currocam! It would be great to get PBS (in some form) into tskit. I think the choices to made are:

  1. whether to use Fst or not (and if not, whether to name it something else);
  2. if using Fst, which version to use;
  3. how to leave the door open for future versions (i.e., different choices for (1) and (2)) in the API.

I think my preliminary answers would be:

  1. The conclusion my former self came to above was to use Fst (or, make up a new statistic and call it something else).
  2. This is really a separate issue: as part of implementing PBS one might choose to also implement other definitions of Fst. Note that sgkit.pbs also provides the option for using different definitions of Fst (so, you could presumably use that to get something that agrees with your calculation using tskit).
  3. Probably an argument method=, as we suggest also in Clarify which version of Fst we compute #858?

It looks to me like you've probably got the relevant code, so this would be straightforward? If you want to do this, the first step is to write out a careful proposal, summarizing the discussion above, explaining how and why you'd compute and what the API is, either in this thread or in a new PR (with e.g., just the docstring). And either also implment the method="reynolds" argument to Fst or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Python API Issue is about the Python API
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants