Genome-wide selection scans #232

jeromekelleher · 2020-09-02T14:30:23Z

This is an umbrella issue covering a significant area of functionality, similar to #226 and #67.

Windowing along the genome Windowing along the genome #229
The PBS statistic PBS selection statistic #230
The Garud H statistic Garud H Haplotype diversity statistic #231

The goal here is to be able to run these selection scans at interactive speed on large datasets, for multiple population samples, and where each population sample has 100-1000 individuals. Note that this means we also need methods for grouping samples (#224).

hammer · 2020-09-02T16:48:11Z

Some additional details on our discourse: https://discourse.pystatgen.org/t/genome-wide-selection-scans/90

alimanfoo · 2020-09-03T17:12:34Z

Many thanks @jeromekelleher for raising this. Just to mention that I'm currently working on genome-wide selection scans in our mosquito data (Ag1000G phase 2), and thought it might be interesting to give a flavour of the data. I'm currently running scans using four methods, PBS, H12 (Garud H), iHS and XPEHH, on each of 14 populations. Because of >20 years of intense mosquito control efforts, there are some very strong signals, some of which are driven by genes we already know about, others we don't. Here's what the scans look like for one of these populations:

Here's some plots zoomed in on a known insecticide resistance gene (Cyp6p3), which is under selection in multiple mosquito populations:

tomwhite · 2020-11-05T13:28:38Z

With #368 we can now compute PBS for windows defined along the genome. I have started to port @alimanfoo's PBS notebook (using scikit-allel) to sgkit.

Here is an example for one cohort triple: https://nbviewer.jupyter.org/github/tomwhite/shiny-train/blob/sgkit/notebooks/gwss/sgkit_pbs_3cohorts.ipynb. The values calculated by sgkit for this triple match those from scikit-allel for chromosome X, which is the only one I compared. (The other chromosomes would require more work to compare since the windowing is complicated by how the two arms of chromosomes 2 and 3 are stitched together, and also some special-casing due to an inversion.)

I also created a notebook to calculate all cohort triples in one computation, but I realized that it's not possible with the current approach in sgkit. It comes down to the way that variants are filtered out if they are not segregating.

I thought we could apply these operations in order: i) restrict to segregating sites, ii) window, iii) compute PBS for all cohort triples. However, i) doesn't work since segregating sites are defined for each cohort triple, not for the whole set of samples in one go. This in turn means that the windows are different for each cohort triple, which means that we can't just apply one set of windows before calculating PBS.

One fix would just be to compute each desired triple using the approach in my first notebook - although it's very inefficient at the moment, and a lot of the computation could be shared.

More generally, it might be possible to add a dimension to windows so that each cohort triple can have a different set of windows. I'm also wondering if rather than filtering out non-segregating sites and saving to disk, it would be possible to have a mask which the windowing functions are aware of so they can skip masked sites when constructing and using windows. (I hit the same issue as #299 when filtering out non-segregating sites - cc @eric-czech.)

jeromekelleher · 2020-11-05T15:58:53Z

Looks good, thanks @tomwhite. I wonder if windowing by physical distance here would help rather than by number of variants? I think the current way we're doing it is very restrictive, and not actually what user's will typically want for analysis.

tomwhite · 2020-11-30T09:52:48Z

Closing this now that we have sgkit notebooks for doing genome-wide selection scans on mosquito data using Garud H and PBS: https://github.com/tomwhite/shiny-train/tree/sgkit/notebooks/gwss

kullrich · 2022-01-21T17:45:09Z

Hi,
I was wondering if you plan to implement the IHS and XPEHH stats in sgkit and by this step also a unphased version of it?

cggh/scikit-allel#374

Best regards

Kristian

hammer · 2022-01-21T18:27:40Z

Hey @kullrich,

Thanks for the request! If you want to file an issue here with more details, we'll take a look.

Regards,
Jeff

hammer added the core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. label Sep 3, 2020

This was referenced Sep 8, 2020

Update popgen stats to compute for all variants #238

Closed

Naming convention for popgen stats and variables #239

Closed

tomwhite mentioned this issue Nov 10, 2020

Garud H statistics #378

Merged

tomwhite closed this as completed Nov 30, 2020

tomwhite mentioned this issue Jun 8, 2021

Example popgen notebook #602

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Genome-wide selection scans #232

Genome-wide selection scans #232

jeromekelleher commented Sep 2, 2020 •

edited by tomwhite

Loading

hammer commented Sep 2, 2020

alimanfoo commented Sep 3, 2020

tomwhite commented Nov 5, 2020

jeromekelleher commented Nov 5, 2020

tomwhite commented Nov 30, 2020

kullrich commented Jan 21, 2022

hammer commented Jan 21, 2022

Genome-wide selection scans #232

Genome-wide selection scans #232

Comments

jeromekelleher commented Sep 2, 2020 • edited by tomwhite Loading

hammer commented Sep 2, 2020

alimanfoo commented Sep 3, 2020

tomwhite commented Nov 5, 2020

jeromekelleher commented Nov 5, 2020

tomwhite commented Nov 30, 2020

kullrich commented Jan 21, 2022

hammer commented Jan 21, 2022

jeromekelleher commented Sep 2, 2020 •

edited by tomwhite

Loading