Skip to content

Add read_vcfzarr #40

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 10, 2020
Merged

Add read_vcfzarr #40

merged 6 commits into from
Aug 10, 2020

Conversation

tomwhite
Copy link
Collaborator

This implements @alimanfoo's suggestion at https://github.com/pystatgen/sgkit/issues/2#issuecomment-646138160 for reading VCF data via Zarr.

  • scikit-allel does not store phasing information, so the call/genotype_phased variable is not populated.
  • Variant IDs can be . in the VCF, meaning missing, so I added a variant/id_mask variable.
  • Variant alleles were documented to be of type S1, whereas they can be of any length as the example shows, so I've updated the docs. Also, the max length is computed as a part of loading, so we can assign the type (e.g. "S3" if the max allele length is 3).
  • I created test data as a zipped zarr file using scikit-allel that is checked into the test/data directory.
  • I've copied encode_array from sgkit-plink to this repo. It's also needed by sgkit-bgen, so should live here so it can be shared, but probably in a different module.
  • I haven't considered metadata for this PR.

@alimanfoo
Copy link
Collaborator

alimanfoo commented Jul 14, 2020

Hi @tomwhite, one thing to account for is that there are two ways that the scikit-allel-style zarrs can be laid out.

Grouped by contig

The main way that I lay them out (and recommend to others) is to group the data by contig. This means there is one group for each contig in the zarr, and then within each of those contig groups there are "variants" and "calldata" groups. In this case you'd expect to navigate to the arrays using paths like "/{contig}/variants/POS" and "/{contig}/calldata/GT".

E.g., try this:

import zarr
import fsspec
store = fsspec.get_mapper('gs://1000genomes-zarr/ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes')
h1k_phase3 = zarr.open_consolidated(store=store)
print(h1k_phase3.tree())

Here's a truncated view of the output...

/
 ├── 1
 │   ├── calldata
 │   │   └── GT (6468094, 2504, 2) int8
 │   └── variants
 │       ├── AA (6468094,) object
 │       ├── AC (6468094, 12) int32
 │       ├── AF (6468094, 12) float32
 │       └── ... (more arrays)
 ├── 2
 │   ├── calldata
 │   │   └── GT (7081600, 2504, 2) int8
 │   └── variants
 │       ├── AA (7081600,) object
 │       ├── AC (7081600, 12) int32
 │       ├── AF (7081600, 12) float32
 │       └── ... (more arrays)
 ├── ... (more contig groups)
 └── samples
     ├── ID (2504,) object
     ├── gender (2504,) object
     ├── pop (2504,) object
     └── super_pop (2504,) object

Note that with this layout you never need to read the "/{contig}/variants/CHROM" arrays because the contig is implied by the grouping. Indeed the CHROM arrays may even be absent.

Ungrouped

Alternatively you can use scikit-allel to create a zarr that stores data from the whole genome together, i.e., there are no contig groups, and arrays contain data from the whole genome. In this case you would expect to navigate directly to the arrays, e.g., "/variants/POS" or "/calldata/GT".

With this layout you of course would need to read the "/{contig}/variants/CHROM" arrays to determine which contig for each variant.


Because I use the grouped-by-contig layout as the recommended layout, when I prototyped a vcfzarr_to_xarray function recently I added logic to handle this grouping by contig, and to concatenated the data from each group.

I'd suggest doing something similar here, i.e.:

  • Offering a parameter for the user to specify which layout is in use (e.g., grouped=True) which is set to whatever value means grouped-by-contig by default. (An alternative might be to automatically detect grouping by contig by looking for the presence of "variants" and "calldata" groups at the root of the hierarchy.)
  • Offering a parameter to allow the user to select which contigs to include (e.g., contigs=['1', '2']) with the default being all contigs. (We often analyse data from a single contig at a time, so including all contigs all the time is a bit unnecessary.)

@alimanfoo
Copy link
Collaborator

Just to add that there is another variation in layout that would be good to accommodate, which applies to how the sample information is stored.

Most of the time only the sample identifiers are stored, because this is all that is present in the source VCF. In this case I would expect to find an array with path "/samples" which is an array of sample identifiers. An example is the Ag1000G phase 2 data:

>>> store = fsspec.get_mapper('gs://ag1000g-release/phase2.AR1/variation/main/zarr/all/ag1000g.phase2.ar1')
>>> ag1000g_phase2 = zarr.open_consolidated(store=store)
>>> ag1000g_phase2['samples']
<zarr.core.Array '/samples' (1142,) object>

However, I'm also starting to add in more sample data, in which case the sample identifiers will be in a subgroup and accessed via a path like "/samples/ID". E.g., the human 1000 genomes phase 3:

>>> store = fsspec.get_mapper('gs://1000genomes-zarr/ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes')
>>> h1k_phase3 = zarr.open_consolidated(store=store)
>>> h1k_phase3['samples'].tree()
samples
 ├── ID (2504,) object
 ├── gender (2504,) object
 ├── pop (2504,) object
 └── super_pop (2504,) object

So it would be good to add some logic to check whether "/samples" is an array or group, and deal with accordingly.

Comment on lines +45 to +36
def read_vcfzarr(path: PathType) -> xr.Dataset:
"""Read a VCF Zarr file.

Loads VCF variant, sample, and genotype data as Dask arrays within a Dataset
from a Zarr file created using scikit-allel's `vcf_to_zarr` function.

Since `vcf_to_zarr` does not preserve phasing information, there is no
`call/genotype_phased` variable in the resulting dataset.

Parameters
----------
path : PathType
Path to the Zarr file.

Returns
-------
xr.Dataset
The dataset of genotype calls, created using `create_genotype_call_dataset`.
"""

vcfzarr = zarr.open_group(str(path), mode="r")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def read_vcfzarr(path: PathType) -> xr.Dataset:
"""Read a VCF Zarr file.
Loads VCF variant, sample, and genotype data as Dask arrays within a Dataset
from a Zarr file created using scikit-allel's `vcf_to_zarr` function.
Since `vcf_to_zarr` does not preserve phasing information, there is no
`call/genotype_phased` variable in the resulting dataset.
Parameters
----------
path : PathType
Path to the Zarr file.
Returns
-------
xr.Dataset
The dataset of genotype calls, created using `create_genotype_call_dataset`.
"""
vcfzarr = zarr.open_group(str(path), mode="r")
def read_vcfzarr(store: Union[str, Mapping], consolidated=False: bool) -> xr.Dataset:
"""Read data in Zarr format that was created using scikit-allel's `vcf_to_zarr` function.
Loads variant, sample, and genotype data as Dask arrays within a Dataset.
Since `vcf_to_zarr` does not preserve phasing information, there is no
`call/genotype_phased` variable in the resulting dataset.
Parameters
----------
store : Union[str, Mapping]
Either a string providing a URL or local file system path, or a store object.
consolidated : bool
Set True if the store uses consolidated metadata.
Returns
-------
xr.Dataset
The dataset of genotype calls, created using `create_genotype_call_dataset`.
"""
if isinstance(path, str):
# assume path is an fsspec-style url path
import fsspec
store = fsspec.get_mapper(path)
else:
# assume path is a store
store = path
if consolidated:
vcfzarr = zarr.open_consolidated(store=store, mode="r")
else:
vcfzarr = zarr.open(store=store, mode="r")

Two suggestions here.

First suggestion is to allow the first argument to be either a string or a mapping (dict-like) object. If a string then assume it's an fsspec-style URL and load via fsspec - this gives the ability to read from cloud object stores. If it's a mapping then treat it as a store - this gives full flexibility to pass in any type of zarr store.

Second suggestion is to expose a consolidated argument, which allows the user to specify if consolidated metadata has been used (this is an optimisation for cloud stores).

@hammer
Copy link
Contributor

hammer commented Jul 15, 2020

Hey @tomwhite how are you thinking about putting this into sgkit.io vs. sgkit-vcf?

Relatedly, do you imagine that sgkit will be able to serialize/deserialize data to/from Zarr in a way that will be agnostic to the file format from which the data originated, or will we always need to put a plink, bgen, or vcf prefix on these methods?

@hammer
Copy link
Contributor

hammer commented Jul 15, 2020

I've copied encode_array from sgkit-plink to this repo. It's also needed by sgkit-bgen, so should live here so it can be shared, but probably in a different module.

Yikes have we already hit a need for sgkit-common? We should keep an eye on whether the repo split is causing more work than the value it creates. It will be interesting to see if it's truly hard to install some of the third party IO libraries for users.

@jeromekelleher
Copy link
Collaborator

Yikes have we already hit a need for sgkit-common? We should keep an eye on whether the repo split is causing more work than the value it creates. It will be interesting to see if it's truly hard to install some of the third party IO libraries for users.

Surely this is just sgkit itself? I think this is what @tomwhite was suggesting above, that we create a sgkit.util or something module.

I'm starting to think the vcfzarr function should live in sgkit-vcf though. What's the advantage of putting it in sgkit vs sgkit-vcf? I think it would really help if all the code that knows about VCF structure lived in a single repo, and we convert to our own on-disk format. Conversion is a once off cost - we should be focusing on making our own disk format work really well for a whole range of tasks, not on how to make working with VCFs slightly less awful.

@tomwhite
Copy link
Collaborator Author

@alimanfoo thanks for the suggestions about variations in the Zarr layout that we should support. I assume they can all be generated using scikit-allel's vcf_to_zarr. Agree about accepting a Mapping too - I just hadn't got round to adding it yet.

do you imagine that sgkit will be able to serialiaze/deserialize data to/from Zarr in a way that will be agnostic to the file format from which the data originated, or will we always need to put a plink, bgen, or vcf prefix on these methods?

I think it will be agnostic, since you can just use Xarray's Dataset.to_zarr method. It would be a good idea to add some metadata to indicate the data provenance, e.g. file it was loaded from, timestamp, etc.

Surely this is just sgkit itself? I think this is what @tomwhite was suggesting above, that we create a sgkit.util or something module.

Right - that's exactly what I was suggesting.

I'm starting to think the vcfzarr function should live in sgkit-vcf though.

I'm fine with that. I might iterate on this in this PR first though.

@alimanfoo
Copy link
Collaborator

I'm starting to think the vcfzarr function should live in sgkit-vcf though. What's the advantage of putting it in sgkit vs sgkit-vcf?

At some point in the future, I imagine we might like to develop new functions for reading VCF files and writing out to zarr that is natively structured in the sgkit convention. That new code when written would seem to naturally live in a separate sgkit-vcf repo, e.g., because it might have specific dependencies for reading VCF such as htslib.

This PR, however, is doing something a bit different, which is opening scikit-allel-flavoured zarr data. That data is structured in a way that is very close to the sgkit xarray convention, so it's really a thin shim to do the transformation. That data happened to originate from VCF, but it's somewhat an indirect relationship.

So I'd suggest that this PR and the open_vcfzarr() function goes into sgkit, at least for now. It's a relatively small piece of code, and we wouldn't seem to gain much from factoring out to a separate package.

@alimanfoo
Copy link
Collaborator

Relatedly, do you imagine that sgkit will be able to serialize/deserialize data to/from Zarr in a way that will be agnostic to the file format from which the data originated, or will we always need to put a plink, bgen, or vcf prefix on these methods?

I imagine that ultimately sgkit will be able to open zarr data in the same way, regardless of which format the data originated in. That will depend on having support for reading different formats and outputting all to a common sgkit zarr format, which I imagine will follow the xarray zarr conventions and so can then be opened with xarray.open_zarr().

For the moment, the scikit-allel-flavoured zarr is a bit of a one-off case, because the data is already in zarr but isn't quite in sgkit/xarray conventions and so needs a little adapter.

@alimanfoo
Copy link
Collaborator

I think it will be agnostic, since you can just use Xarray's Dataset.to_zarr method. It would be a good idea to add some metadata to indicate the data provenance, e.g. file it was loaded from, timestamp, etc.

Agree with this, although a nuance here worth discussing at some point is that the path to zarr may need to be different for different file formats for practical reasons to do with executing the transformation for large datasets.

I'm guessing there will be two different cases.

One case would be formats like plink bed/bim/fam where the data can be parsed efficiently and it is convenient and performant to go via xarray.to_zarr() when performing the conversion to zarr.

The other case might be formats like VCF where parsing is slow and where we might want more specialised functions that read directly from VCF and write out to zarr, not going via xarray.to_zarr().

But in both cases the output should be the same, i.e., zarr data that conforms to the xarray+sgkit zarr conventions.

@alimanfoo
Copy link
Collaborator

Small comment, do we want to name this "open_vcfzarr" rather than "read_vcfzarr", just to follow the naming convention in xarray for opening files (e.g., xarray.open_zarr)?

@tomwhite
Copy link
Collaborator Author

I've updated this PR to use the encode_array utility, and to handle the case where variants_alt is 1D.

I've opened a new issue (#56) for supporting more flexible layout options, since they can build on this PR, which is to add a basic path to importing VCF data via scikit-allel's vcf_to_zarr function.

I couldn't see a way to generate these alternate layouts using vcf_to_zarr (but I may be missing something). I also wasn't able to access a Zarr zip file as a Zarr Mapping object (fsspec.get_mapper(f"zip::{path}") didn't work). But I think these improvements can be handled later, and will need to be to access cloud storage.

@tomwhite
Copy link
Collaborator Author

Small comment, do we want to name this "open_vcfzarr" rather than "read_vcfzarr", just to follow the naming convention in xarray for opening files (e.g., xarray.open_zarr)?

For writing files the convention is to_X, e.g. to_zarr. So we would have functions for open_plink/to_plink, open_bgen/to_bgen, open_vcfzarr/to_vcfzarr. What do others think?

@tomwhite
Copy link
Collaborator Author

Apart from the naming question (open_vcfzarr or read_vcfzarr) I think this is complete (there are follow ups in #56). I'd like to get @alimanfoo's view before merging.

@alimanfoo
Copy link
Collaborator

alimanfoo commented Aug 6, 2020

Hi @tomwhite, apologies for slow feedback on this.

Happy to see this merged and deal with points raised in follow up PRs, your call.

From my point of view, this function becomes valuable when we can load data from the Anopheles gambiae 1000 Genomes Project (Ag1000G), which is the WGS dataset that I work on and mostly drives the requirements I have for sgkit.

Ag1000G data are stored in google cloud storage, use consolidated metadata, and grouped by contig. I.e., I'd like to be able to do:

callset = sgkit.open_vcfzarr('gs://ag1000g-release/phase2.AR1/variation/main/zarr/all/ag1000g.phase2.ar1', consolidated=True)

In general, I recommend grouping by contig to scikit-allel users, and so I think this is the more common case.

So FWIW I would be inclined to add support for fsspec-style URLs and for consolidated metadata in this PR as they are small changes.

Also FWIW I would be inclined to add support for grouping by contig and make that the default.

One partially-related comment, currently you do some reading of data values to ascertain the dtype for the variant_alleles array and compute the variant_contig array. This is going to cause some lag for the user, particularly for large datasets, and would be great to avoid any reading of data or computation if at all possible within this function.

I think reading data could be avoided when determining the dtype for the variant_alleles array. I don't think you need to know the maximum length of the string values, you could just take the maximum length of the dtype in the input REF and ALT arrays. E.g., if REF is S1 and ALT is S1 then you can use S1 for variant_alleles. Similarly if REF was S3 and ALT was S5 you could use S5 for variant_alleles.

Also, reading variants/CHROM and computation of the variant_contig array is not necessary if the input data are grouped by contig. This is because the group name tells you which chromosome variants belong to, without having to read the variants/CHROM array.

Hope that's useful, happy to play this however you think best.

@tomwhite
Copy link
Collaborator Author

tomwhite commented Aug 6, 2020

Thanks for the comments @alimanfoo!

From my point of view, this function becomes valuable when we can load data from the Anopheles gambiae 1000 Genomes Project (Ag1000G), which is the WGS dataset that I work on and mostly drives the requirements I have for sgkit.

I agree that is where we should be aiming to get to. My goal with this PR was a bit more modest: it was to validate the sgkit representation and ensure that there was a path from VCF to sgkit. I think those things have been achieved, so I'd like to keep this PR to that more restricted scope and address the other things in follow on PRs (https://github.com/pystatgen/sgkit/issues/56). In general, I favour smaller PRs where possible as it makes things easier to review and merge.

(BTW do you have any hints on how to generate these alternative layouts? I couldn't see how to do it with vcf_to_zarr.)

I think reading data could be avoided when determining the dtype for the variant_alleles array. I don't think you need to know the maximum length of the string values, you could just take the maximum length of the dtype in the input REF and ALT arrays. E.g., if REF is S1 and ALT is S1 then you can use S1 for variant_alleles. Similarly if REF was S3 and ALT was S5 you could use S5 for variant_alleles.

That would certainly be more efficient. I had a look at doing this but the dtype of REF and ALT for the Zarr files I generated with vcf_to_zarr was object, so we have to look through all the string values in that case.

@alimanfoo
Copy link
Collaborator

I agree that is where we should be aiming to get to. My goal with this PR was a bit more modest: it was to validate the sgkit representation and ensure that there was a path from VCF to sgkit. I think those things have been achieved, so I'd like to keep this PR to that more restricted scope and address the other things in follow on PRs (#56). In general, I favour smaller PRs where possible as it makes things easier to review and merge.

Sure, totally understand.

(BTW do you have any hints on how to generate these alternative layouts? I couldn't see how to do it with vcf_to_zarr.)

Do you mean grouping by chromosome? (If so, example here.)

Or the alternative ways to store sample variables?

I think reading data could be avoided when determining the dtype for the variant_alleles array. I don't think you need to know the maximum length of the string values, you could just take the maximum length of the dtype in the input REF and ALT arrays. E.g., if REF is S1 and ALT is S1 then you can use S1 for variant_alleles. Similarly if REF was S3 and ALT was S5 you could use S5 for variant_alleles.

That would certainly be more efficient. I had a look at doing this but the dtype of REF and ALT for the Zarr files I generated with vcf_to_zarr was object, so we have to look through all the string values in that case.

Ah yes, I forgot I sometimes use object dtype for strings with zarr. I would suggest to accept object dtype and leave it as-is. I.e., the variant_alleles array can be either S or object dtype.

@eric-czech
Copy link
Collaborator

Ah yes, I forgot I sometimes use object dtype for strings with zarr. I would suggest to accept object dtype and leave it as-is. I.e., the variant_alleles array can be either S or object dtype.

+1 to that, I pulled it out in https://github.com/pystatgen/sgkit/issues/98 since I think we should consider the same for plink/bgen.

@tomwhite
Copy link
Collaborator Author

I would suggest to accept object dtype and leave it as-is. I.e., the variant_alleles array can be either S or object dtype.

I've fixed this now.

This is ready to merge.

Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tomwhite tomwhite merged commit a7ad4ae into sgkit-dev:master Aug 10, 2020
@hammer hammer mentioned this pull request Aug 12, 2020
@eric-czech eric-czech mentioned this pull request Aug 12, 2020
daletovar pushed a commit to daletovar/sgkit that referenced this pull request Aug 21, 2020
daletovar pushed a commit to daletovar/sgkit that referenced this pull request Aug 25, 2020
daletovar pushed a commit to daletovar/sgkit that referenced this pull request Aug 27, 2020
daletovar added a commit to daletovar/sgkit that referenced this pull request Aug 30, 2020
remove ts_to_dataset from public api

make divergence take in two datasets

add minimal fst

Add read_vcfzarr (sgkit-dev#40)

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

add tajimas d

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

make divergence take in two datasets

add minimal fst

add tajimas d

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

add minimal fst
daletovar added a commit to daletovar/sgkit that referenced this pull request Aug 31, 2020
remove ts_to_dataset from public api

make divergence take in two datasets

add minimal fst

Add read_vcfzarr (sgkit-dev#40)

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

add tajimas d

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

make divergence take in two datasets

add minimal fst

add tajimas d

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

add minimal fst

add tajimas d

fix allele count

update cfg

remove spaces

add msprime and use np.testing
daletovar added a commit to daletovar/sgkit that referenced this pull request Sep 2, 2020
remove ts_to_dataset from public api

make divergence take in two datasets

add minimal fst

Add read_vcfzarr (sgkit-dev#40)

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

add tajimas d

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

make divergence take in two datasets

add minimal fst

add tajimas d

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

add minimal fst

add tajimas d

fix allele count

update cfg

remove spaces

add msprime and use np.testing

add libgsl-dev dependency

add docstrings
daletovar added a commit to daletovar/sgkit that referenced this pull request Sep 2, 2020
remove ts_to_dataset from public api

make divergence take in two datasets

add minimal fst

Add read_vcfzarr (sgkit-dev#40)

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

add tajimas d

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

make divergence take in two datasets

add minimal fst

add tajimas d

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

add minimal fst

add tajimas d

fix allele count

update cfg

remove spaces

add msprime and use np.testing

add libgsl-dev dependency

add docstrings

ignore dep warning

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

make divergence take in two datasets

add minimal fst

Add read_vcfzarr (sgkit-dev#40)

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

add tajimas d

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

make divergence take in two datasets

add minimal fst

add tajimas d

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

add minimal fst

add tajimas d

fix allele count

update cfg

remove spaces

add msprime and use np.testing

add libgsl-dev dependency

add docstrings

fix divide by zero
jeromekelleher pushed a commit to daletovar/sgkit that referenced this pull request Sep 14, 2020
remove ts_to_dataset from public api

make divergence take in two datasets

add minimal fst

Add read_vcfzarr (sgkit-dev#40)

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

add tajimas d

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

make divergence take in two datasets

add minimal fst

add tajimas d

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

add minimal fst

add tajimas d

fix allele count

update cfg

remove spaces

add msprime and use np.testing

add libgsl-dev dependency

add docstrings

ignore dep warning

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

make divergence take in two datasets

add minimal fst

Add read_vcfzarr (sgkit-dev#40)

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

add tajimas d

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

make divergence take in two datasets

add minimal fst

add tajimas d

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

add minimal fst

add tajimas d

fix allele count

update cfg

remove spaces

add msprime and use np.testing

add libgsl-dev dependency

add docstrings

fix divide by zero
jeromekelleher pushed a commit that referenced this pull request Sep 14, 2020
remove ts_to_dataset from public api

make divergence take in two datasets

add minimal fst

Add read_vcfzarr (#40)

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

add tajimas d

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

make divergence take in two datasets

add minimal fst

add tajimas d

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

add minimal fst

add tajimas d

fix allele count

update cfg

remove spaces

add msprime and use np.testing

add libgsl-dev dependency

add docstrings

ignore dep warning

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

make divergence take in two datasets

add minimal fst

Add read_vcfzarr (#40)

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

add tajimas d

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

make divergence take in two datasets

add minimal fst

add tajimas d

add ts_to_dataset

add minimal diversity and divergence

remove ts_to_dataset from public api

add minimal fst

add tajimas d

fix allele count

update cfg

remove spaces

add msprime and use np.testing

add libgsl-dev dependency

add docstrings

fix divide by zero
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants