Skip to content

Add variant annotation functions #112

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
eric-czech opened this issue Aug 13, 2020 · 4 comments
Open

Add variant annotation functions #112

eric-czech opened this issue Aug 13, 2020 · 4 comments
Labels
core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc.

Comments

@eric-czech
Copy link
Collaborator

In quantitative genetics it is common not to treat alleles at a locus as equal. The "functional consequence" of each allele is important and the process for determining these consequences is well standardized in VEP (in coding regions at least).

Providing access to annotations like this, ideally using the LOFTEE plugin, would be very useful since it is a common task and not necessarily an easy one. Hail's vep and nirvana functions could be a good guide.

@hammer hammer added the core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. label Aug 13, 2020
@hammer
Copy link
Contributor

hammer commented Sep 3, 2020

Given the approach in #227 to not vendor a PCA implementation, do you think we might want to just document how to use an external library to annotate variants, or do you think we will need some code inside sgkit to make variant annotation work nicely with our data structures?

@eric-czech
Copy link
Collaborator Author

🤔 The hail solution is to:

  • Have you manage the VEP (or Nirvana) install
  • Split up the variant metadata for a dataset and run the vep CLI on vcfs
  • Recombine the results

Assuming cloud storage, I want to say it would actually be easier (from a user's perspective) for us to create a docker image with vep installed and then have something like a snakemake pipeline on GKE read exported variant data from Xarray/sgkit and produce results you can read back in easily.

Distributing variant data and running VEP on it isn't the hard part IMO, it's managing the installation on a cluster that will be a pain for users. I'm not sure how to make that go away without docker, so there is perhaps some advantage to us having an sgkit docker image that descends from the dask image used by Helm with this extra stuff installed. That would certainly make it easier to avoid needing an external pipeline tool.

I would classify it a little differently than PCA though since the external library is so much harder to apply in this case.

@jeromekelleher
Copy link
Collaborator

This feels to me like it's outside our remit - integrating with the Pydata ecosystem. If we start front-ending VEP for users, where do we stop? Certainly we should support processing VEP annotations but I think running VEP should be outside our scope.

@eric-czech
Copy link
Collaborator Author

This feels to me like it's outside our remit - integrating with the Pydata ecosystem. If we start front-ending VEP for users, where do we stop?

Good point. I can see there being some satellite pystatgen repos that are specific to putting some kind of compatible front end on hard-to-scale CLI tools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc.
Projects
None yet
Development

No branches or pull requests

3 participants