-
Notifications
You must be signed in to change notification settings - Fork 35
Genome-wide selection scans #232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Some additional details on our discourse: https://discourse.pystatgen.org/t/genome-wide-selection-scans/90 |
Many thanks @jeromekelleher for raising this. Just to mention that I'm currently working on genome-wide selection scans in our mosquito data (Ag1000G phase 2), and thought it might be interesting to give a flavour of the data. I'm currently running scans using four methods, PBS, H12 (Garud H), iHS and XPEHH, on each of 14 populations. Because of >20 years of intense mosquito control efforts, there are some very strong signals, some of which are driven by genes we already know about, others we don't. Here's what the scans look like for one of these populations: Here's some plots zoomed in on a known insecticide resistance gene (Cyp6p3), which is under selection in multiple mosquito populations: |
With #368 we can now compute PBS for windows defined along the genome. I have started to port @alimanfoo's PBS notebook (using scikit-allel) to sgkit. Here is an example for one cohort triple: https://nbviewer.jupyter.org/github/tomwhite/shiny-train/blob/sgkit/notebooks/gwss/sgkit_pbs_3cohorts.ipynb. The values calculated by sgkit for this triple match those from scikit-allel for chromosome X, which is the only one I compared. (The other chromosomes would require more work to compare since the windowing is complicated by how the two arms of chromosomes 2 and 3 are stitched together, and also some special-casing due to an inversion.) I also created a notebook to calculate all cohort triples in one computation, but I realized that it's not possible with the current approach in sgkit. It comes down to the way that variants are filtered out if they are not segregating. I thought we could apply these operations in order: i) restrict to segregating sites, ii) window, iii) compute PBS for all cohort triples. However, i) doesn't work since segregating sites are defined for each cohort triple, not for the whole set of samples in one go. This in turn means that the windows are different for each cohort triple, which means that we can't just apply one set of windows before calculating PBS. One fix would just be to compute each desired triple using the approach in my first notebook - although it's very inefficient at the moment, and a lot of the computation could be shared. More generally, it might be possible to add a dimension to windows so that each cohort triple can have a different set of windows. I'm also wondering if rather than filtering out non-segregating sites and saving to disk, it would be possible to have a mask which the windowing functions are aware of so they can skip masked sites when constructing and using windows. (I hit the same issue as #299 when filtering out non-segregating sites - cc @eric-czech.) |
Looks good, thanks @tomwhite. I wonder if windowing by physical distance here would help rather than by number of variants? I think the current way we're doing it is very restrictive, and not actually what user's will typically want for analysis. |
Closing this now that we have sgkit notebooks for doing genome-wide selection scans on mosquito data using Garud H and PBS: https://github.com/tomwhite/shiny-train/tree/sgkit/notebooks/gwss |
Hi, Best regards Kristian |
Hey @kullrich, Thanks for the request! If you want to file an issue here with more details, we'll take a look. Regards, |
This is an umbrella issue covering a significant area of functionality, similar to #226 and #67.
The goal here is to be able to run these selection scans at interactive speed on large datasets, for multiple population samples, and where each population sample has 100-1000 individuals. Note that this means we also need methods for grouping samples (#224).
The text was updated successfully, but these errors were encountered: