-
Notifications
You must be signed in to change notification settings - Fork 35
Add read_vcfzarr #40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add read_vcfzarr #40
Conversation
Hi @tomwhite, one thing to account for is that there are two ways that the scikit-allel-style zarrs can be laid out. Grouped by contig The main way that I lay them out (and recommend to others) is to group the data by contig. This means there is one group for each contig in the zarr, and then within each of those contig groups there are "variants" and "calldata" groups. In this case you'd expect to navigate to the arrays using paths like "/{contig}/variants/POS" and "/{contig}/calldata/GT". E.g., try this: import zarr
import fsspec
store = fsspec.get_mapper('gs://1000genomes-zarr/ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes')
h1k_phase3 = zarr.open_consolidated(store=store)
print(h1k_phase3.tree()) Here's a truncated view of the output...
Note that with this layout you never need to read the "/{contig}/variants/CHROM" arrays because the contig is implied by the grouping. Indeed the CHROM arrays may even be absent. Ungrouped Alternatively you can use scikit-allel to create a zarr that stores data from the whole genome together, i.e., there are no contig groups, and arrays contain data from the whole genome. In this case you would expect to navigate directly to the arrays, e.g., "/variants/POS" or "/calldata/GT". With this layout you of course would need to read the "/{contig}/variants/CHROM" arrays to determine which contig for each variant. Because I use the grouped-by-contig layout as the recommended layout, when I prototyped a vcfzarr_to_xarray function recently I added logic to handle this grouping by contig, and to concatenated the data from each group. I'd suggest doing something similar here, i.e.:
|
Just to add that there is another variation in layout that would be good to accommodate, which applies to how the sample information is stored. Most of the time only the sample identifiers are stored, because this is all that is present in the source VCF. In this case I would expect to find an array with path "/samples" which is an array of sample identifiers. An example is the Ag1000G phase 2 data: >>> store = fsspec.get_mapper('gs://ag1000g-release/phase2.AR1/variation/main/zarr/all/ag1000g.phase2.ar1')
>>> ag1000g_phase2 = zarr.open_consolidated(store=store)
>>> ag1000g_phase2['samples']
<zarr.core.Array '/samples' (1142,) object> However, I'm also starting to add in more sample data, in which case the sample identifiers will be in a subgroup and accessed via a path like "/samples/ID". E.g., the human 1000 genomes phase 3:
So it would be good to add some logic to check whether "/samples" is an array or group, and deal with accordingly. |
def read_vcfzarr(path: PathType) -> xr.Dataset: | ||
"""Read a VCF Zarr file. | ||
|
||
Loads VCF variant, sample, and genotype data as Dask arrays within a Dataset | ||
from a Zarr file created using scikit-allel's `vcf_to_zarr` function. | ||
|
||
Since `vcf_to_zarr` does not preserve phasing information, there is no | ||
`call/genotype_phased` variable in the resulting dataset. | ||
|
||
Parameters | ||
---------- | ||
path : PathType | ||
Path to the Zarr file. | ||
|
||
Returns | ||
------- | ||
xr.Dataset | ||
The dataset of genotype calls, created using `create_genotype_call_dataset`. | ||
""" | ||
|
||
vcfzarr = zarr.open_group(str(path), mode="r") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def read_vcfzarr(path: PathType) -> xr.Dataset: | |
"""Read a VCF Zarr file. | |
Loads VCF variant, sample, and genotype data as Dask arrays within a Dataset | |
from a Zarr file created using scikit-allel's `vcf_to_zarr` function. | |
Since `vcf_to_zarr` does not preserve phasing information, there is no | |
`call/genotype_phased` variable in the resulting dataset. | |
Parameters | |
---------- | |
path : PathType | |
Path to the Zarr file. | |
Returns | |
------- | |
xr.Dataset | |
The dataset of genotype calls, created using `create_genotype_call_dataset`. | |
""" | |
vcfzarr = zarr.open_group(str(path), mode="r") | |
def read_vcfzarr(store: Union[str, Mapping], consolidated=False: bool) -> xr.Dataset: | |
"""Read data in Zarr format that was created using scikit-allel's `vcf_to_zarr` function. | |
Loads variant, sample, and genotype data as Dask arrays within a Dataset. | |
Since `vcf_to_zarr` does not preserve phasing information, there is no | |
`call/genotype_phased` variable in the resulting dataset. | |
Parameters | |
---------- | |
store : Union[str, Mapping] | |
Either a string providing a URL or local file system path, or a store object. | |
consolidated : bool | |
Set True if the store uses consolidated metadata. | |
Returns | |
------- | |
xr.Dataset | |
The dataset of genotype calls, created using `create_genotype_call_dataset`. | |
""" | |
if isinstance(path, str): | |
# assume path is an fsspec-style url path | |
import fsspec | |
store = fsspec.get_mapper(path) | |
else: | |
# assume path is a store | |
store = path | |
if consolidated: | |
vcfzarr = zarr.open_consolidated(store=store, mode="r") | |
else: | |
vcfzarr = zarr.open(store=store, mode="r") |
Two suggestions here.
First suggestion is to allow the first argument to be either a string or a mapping (dict-like) object. If a string then assume it's an fsspec-style URL and load via fsspec - this gives the ability to read from cloud object stores. If it's a mapping then treat it as a store - this gives full flexibility to pass in any type of zarr store.
Second suggestion is to expose a consolidated
argument, which allows the user to specify if consolidated metadata has been used (this is an optimisation for cloud stores).
Hey @tomwhite how are you thinking about putting this into Relatedly, do you imagine that |
Yikes have we already hit a need for |
Surely this is just I'm starting to think the vcfzarr function should live in sgkit-vcf though. What's the advantage of putting it in sgkit vs sgkit-vcf? I think it would really help if all the code that knows about VCF structure lived in a single repo, and we convert to our own on-disk format. Conversion is a once off cost - we should be focusing on making our own disk format work really well for a whole range of tasks, not on how to make working with VCFs slightly less awful. |
@alimanfoo thanks for the suggestions about variations in the Zarr layout that we should support. I assume they can all be generated using scikit-allel's
I think it will be agnostic, since you can just use Xarray's Dataset.to_zarr method. It would be a good idea to add some metadata to indicate the data provenance, e.g. file it was loaded from, timestamp, etc.
Right - that's exactly what I was suggesting.
I'm fine with that. I might iterate on this in this PR first though. |
At some point in the future, I imagine we might like to develop new functions for reading VCF files and writing out to zarr that is natively structured in the sgkit convention. That new code when written would seem to naturally live in a separate sgkit-vcf repo, e.g., because it might have specific dependencies for reading VCF such as htslib. This PR, however, is doing something a bit different, which is opening scikit-allel-flavoured zarr data. That data is structured in a way that is very close to the sgkit xarray convention, so it's really a thin shim to do the transformation. That data happened to originate from VCF, but it's somewhat an indirect relationship. So I'd suggest that this PR and the open_vcfzarr() function goes into sgkit, at least for now. It's a relatively small piece of code, and we wouldn't seem to gain much from factoring out to a separate package. |
I imagine that ultimately sgkit will be able to open zarr data in the same way, regardless of which format the data originated in. That will depend on having support for reading different formats and outputting all to a common sgkit zarr format, which I imagine will follow the xarray zarr conventions and so can then be opened with xarray.open_zarr(). For the moment, the scikit-allel-flavoured zarr is a bit of a one-off case, because the data is already in zarr but isn't quite in sgkit/xarray conventions and so needs a little adapter. |
Agree with this, although a nuance here worth discussing at some point is that the path to zarr may need to be different for different file formats for practical reasons to do with executing the transformation for large datasets. I'm guessing there will be two different cases. One case would be formats like plink bed/bim/fam where the data can be parsed efficiently and it is convenient and performant to go via xarray.to_zarr() when performing the conversion to zarr. The other case might be formats like VCF where parsing is slow and where we might want more specialised functions that read directly from VCF and write out to zarr, not going via xarray.to_zarr(). But in both cases the output should be the same, i.e., zarr data that conforms to the xarray+sgkit zarr conventions. |
Small comment, do we want to name this "open_vcfzarr" rather than "read_vcfzarr", just to follow the naming convention in xarray for opening files (e.g., xarray.open_zarr)? |
I've updated this PR to use the I've opened a new issue (#56) for supporting more flexible layout options, since they can build on this PR, which is to add a basic path to importing VCF data via scikit-allel's I couldn't see a way to generate these alternate layouts using |
For writing files the convention is |
Apart from the naming question ( |
Hi @tomwhite, apologies for slow feedback on this. Happy to see this merged and deal with points raised in follow up PRs, your call. From my point of view, this function becomes valuable when we can load data from the Anopheles gambiae 1000 Genomes Project (Ag1000G), which is the WGS dataset that I work on and mostly drives the requirements I have for sgkit. Ag1000G data are stored in google cloud storage, use consolidated metadata, and grouped by contig. I.e., I'd like to be able to do:
In general, I recommend grouping by contig to scikit-allel users, and so I think this is the more common case. So FWIW I would be inclined to add support for fsspec-style URLs and for consolidated metadata in this PR as they are small changes. Also FWIW I would be inclined to add support for grouping by contig and make that the default. One partially-related comment, currently you do some reading of data values to ascertain the dtype for the variant_alleles array and compute the variant_contig array. This is going to cause some lag for the user, particularly for large datasets, and would be great to avoid any reading of data or computation if at all possible within this function. I think reading data could be avoided when determining the dtype for the variant_alleles array. I don't think you need to know the maximum length of the string values, you could just take the maximum length of the dtype in the input REF and ALT arrays. E.g., if REF is S1 and ALT is S1 then you can use S1 for variant_alleles. Similarly if REF was S3 and ALT was S5 you could use S5 for variant_alleles. Also, reading variants/CHROM and computation of the variant_contig array is not necessary if the input data are grouped by contig. This is because the group name tells you which chromosome variants belong to, without having to read the variants/CHROM array. Hope that's useful, happy to play this however you think best. |
Thanks for the comments @alimanfoo!
I agree that is where we should be aiming to get to. My goal with this PR was a bit more modest: it was to validate the sgkit representation and ensure that there was a path from VCF to sgkit. I think those things have been achieved, so I'd like to keep this PR to that more restricted scope and address the other things in follow on PRs (https://github.com/pystatgen/sgkit/issues/56). In general, I favour smaller PRs where possible as it makes things easier to review and merge. (BTW do you have any hints on how to generate these alternative layouts? I couldn't see how to do it with
That would certainly be more efficient. I had a look at doing this but the dtype of REF and ALT for the Zarr files I generated with |
Sure, totally understand.
Do you mean grouping by chromosome? (If so, example here.) Or the alternative ways to store sample variables?
Ah yes, I forgot I sometimes use object dtype for strings with zarr. I would suggest to accept object dtype and leave it as-is. I.e., the variant_alleles array can be either S or object dtype. |
+1 to that, I pulled it out in https://github.com/pystatgen/sgkit/issues/98 since I think we should consider the same for plink/bgen. |
I've fixed this now. This is ready to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
remove ts_to_dataset from public api make divergence take in two datasets add minimal fst Add read_vcfzarr (sgkit-dev#40) add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api add tajimas d add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api make divergence take in two datasets add minimal fst add tajimas d add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api add minimal fst
remove ts_to_dataset from public api make divergence take in two datasets add minimal fst Add read_vcfzarr (sgkit-dev#40) add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api add tajimas d add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api make divergence take in two datasets add minimal fst add tajimas d add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api add minimal fst add tajimas d fix allele count update cfg remove spaces add msprime and use np.testing
remove ts_to_dataset from public api make divergence take in two datasets add minimal fst Add read_vcfzarr (sgkit-dev#40) add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api add tajimas d add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api make divergence take in two datasets add minimal fst add tajimas d add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api add minimal fst add tajimas d fix allele count update cfg remove spaces add msprime and use np.testing add libgsl-dev dependency add docstrings
remove ts_to_dataset from public api make divergence take in two datasets add minimal fst Add read_vcfzarr (sgkit-dev#40) add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api add tajimas d add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api make divergence take in two datasets add minimal fst add tajimas d add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api add minimal fst add tajimas d fix allele count update cfg remove spaces add msprime and use np.testing add libgsl-dev dependency add docstrings ignore dep warning add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api make divergence take in two datasets add minimal fst Add read_vcfzarr (sgkit-dev#40) add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api add tajimas d add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api make divergence take in two datasets add minimal fst add tajimas d add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api add minimal fst add tajimas d fix allele count update cfg remove spaces add msprime and use np.testing add libgsl-dev dependency add docstrings fix divide by zero
remove ts_to_dataset from public api make divergence take in two datasets add minimal fst Add read_vcfzarr (sgkit-dev#40) add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api add tajimas d add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api make divergence take in two datasets add minimal fst add tajimas d add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api add minimal fst add tajimas d fix allele count update cfg remove spaces add msprime and use np.testing add libgsl-dev dependency add docstrings ignore dep warning add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api make divergence take in two datasets add minimal fst Add read_vcfzarr (sgkit-dev#40) add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api add tajimas d add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api make divergence take in two datasets add minimal fst add tajimas d add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api add minimal fst add tajimas d fix allele count update cfg remove spaces add msprime and use np.testing add libgsl-dev dependency add docstrings fix divide by zero
remove ts_to_dataset from public api make divergence take in two datasets add minimal fst Add read_vcfzarr (#40) add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api add tajimas d add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api make divergence take in two datasets add minimal fst add tajimas d add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api add minimal fst add tajimas d fix allele count update cfg remove spaces add msprime and use np.testing add libgsl-dev dependency add docstrings ignore dep warning add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api make divergence take in two datasets add minimal fst Add read_vcfzarr (#40) add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api add tajimas d add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api make divergence take in two datasets add minimal fst add tajimas d add ts_to_dataset add minimal diversity and divergence remove ts_to_dataset from public api add minimal fst add tajimas d fix allele count update cfg remove spaces add msprime and use np.testing add libgsl-dev dependency add docstrings fix divide by zero
This implements @alimanfoo's suggestion at https://github.com/pystatgen/sgkit/issues/2#issuecomment-646138160 for reading VCF data via Zarr.
call/genotype_phased
variable is not populated..
in the VCF, meaning missing, so I added avariant/id_mask
variable.S1
, whereas they can be of any length as the example shows, so I've updated the docs. Also, the max length is computed as a part of loading, so we can assign the type (e.g. "S3" if the max allele length is 3).encode_array
from sgkit-plink to this repo. It's also needed by sgkit-bgen, so should live here so it can be shared, but probably in a different module.