Skip to content

Genome-wide selection scans #232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
jeromekelleher opened this issue Sep 2, 2020 · 7 comments
Closed
3 tasks done

Genome-wide selection scans #232

jeromekelleher opened this issue Sep 2, 2020 · 7 comments
Labels
core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc.

Comments

@jeromekelleher
Copy link
Collaborator

jeromekelleher commented Sep 2, 2020

This is an umbrella issue covering a significant area of functionality, similar to #226 and #67.

The goal here is to be able to run these selection scans at interactive speed on large datasets, for multiple population samples, and where each population sample has 100-1000 individuals. Note that this means we also need methods for grouping samples (#224).

@hammer
Copy link
Contributor

hammer commented Sep 2, 2020

Some additional details on our discourse: https://discourse.pystatgen.org/t/genome-wide-selection-scans/90

@hammer hammer added the core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. label Sep 3, 2020
@alimanfoo
Copy link
Collaborator

Many thanks @jeromekelleher for raising this. Just to mention that I'm currently working on genome-wide selection scans in our mosquito data (Ag1000G phase 2), and thought it might be interesting to give a flavour of the data. I'm currently running scans using four methods, PBS, H12 (Garud H), iHS and XPEHH, on each of 14 populations. Because of >20 years of intense mosquito control efforts, there are some very strong signals, some of which are driven by genes we already know about, others we don't. Here's what the scans look like for one of these populations:

gwss_bf_gam_gw_ug_gam_gq_gam

Here's some plots zoomed in on a known insecticide resistance gene (Cyp6p3), which is under selection in multiple mosquito populations:

locus_cyp6p3_h12

@tomwhite
Copy link
Collaborator

tomwhite commented Nov 5, 2020

With #368 we can now compute PBS for windows defined along the genome. I have started to port @alimanfoo's PBS notebook (using scikit-allel) to sgkit.

Here is an example for one cohort triple: https://nbviewer.jupyter.org/github/tomwhite/shiny-train/blob/sgkit/notebooks/gwss/sgkit_pbs_3cohorts.ipynb. The values calculated by sgkit for this triple match those from scikit-allel for chromosome X, which is the only one I compared. (The other chromosomes would require more work to compare since the windowing is complicated by how the two arms of chromosomes 2 and 3 are stitched together, and also some special-casing due to an inversion.)

I also created a notebook to calculate all cohort triples in one computation, but I realized that it's not possible with the current approach in sgkit. It comes down to the way that variants are filtered out if they are not segregating.

I thought we could apply these operations in order: i) restrict to segregating sites, ii) window, iii) compute PBS for all cohort triples. However, i) doesn't work since segregating sites are defined for each cohort triple, not for the whole set of samples in one go. This in turn means that the windows are different for each cohort triple, which means that we can't just apply one set of windows before calculating PBS.

One fix would just be to compute each desired triple using the approach in my first notebook - although it's very inefficient at the moment, and a lot of the computation could be shared.

More generally, it might be possible to add a dimension to windows so that each cohort triple can have a different set of windows. I'm also wondering if rather than filtering out non-segregating sites and saving to disk, it would be possible to have a mask which the windowing functions are aware of so they can skip masked sites when constructing and using windows. (I hit the same issue as #299 when filtering out non-segregating sites - cc @eric-czech.)

@jeromekelleher
Copy link
Collaborator Author

Looks good, thanks @tomwhite. I wonder if windowing by physical distance here would help rather than by number of variants? I think the current way we're doing it is very restrictive, and not actually what user's will typically want for analysis.

@tomwhite
Copy link
Collaborator

Closing this now that we have sgkit notebooks for doing genome-wide selection scans on mosquito data using Garud H and PBS: https://github.com/tomwhite/shiny-train/tree/sgkit/notebooks/gwss

@kullrich
Copy link

Hi,
I was wondering if you plan to implement the IHS and XPEHH stats in sgkit and by this step also a unphased version of it?

cggh/scikit-allel#374

Best regards

Kristian

@hammer
Copy link
Contributor

hammer commented Jan 21, 2022

Hey @kullrich,

Thanks for the request! If you want to file an issue here with more details, we'll take a look.

Regards,
Jeff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc.
Projects
None yet
Development

No branches or pull requests

5 participants