VCF Zarr improvements #56

tomwhite · 2020-07-21T11:28:38Z

It would be good to make VCF Zarr reading more flexible:

Support grouping by contig (https://github.com/pystatgen/sgkit/pull/40#issuecomment-658276008)
Support case where samples is a group (https://github.com/pystatgen/sgkit/pull/40#issuecomment-658280113)
Support reading from a Mapping object (not just a string or Path)
Support consolidated metadata

The text was updated successfully, but these errors were encountered:

jerowe · 2020-08-12T14:56:06Z

Hi there. I started on this today.

@alimanfoo gave us some access to vfczarr data. I'm using that as the basis to read in data.

It looks like it the data is for all contigs you have:

2L
 ├── calldata
 │   └── GT (11524923, 1142, 2) int8
 ├── samples (1142,) object
 └── variants
     ├── ABHet (11524923,) float32
...

3L
 ├── calldata
 │   └── GT (11524923, 1142, 2) int8
 ├── samples (1142,) object
 └── variants
     ├── ABHet (11524923,) float32

Where 2L and 3L are the contig names, along with an additional samples key that is an array (that I think could optionally be a group).

It can also be the case that you have 1 zarr file per contig, in which case it looks like this:

 ├── calldata
 │   └── GT (11524923, 1142, 2) int8
 ├── samples (1142,) object
 └── variants
     ├── ABHet (11524923,) float32

So I broke this functionality down into 2 parts. The first is defining an Xarray dataset that we can use as a very minimal interface.

## Define a GenotypeCallSetDocument
  
from typing import Any, Dict, Hashable, List, Optional
import xarray as xr

def create_genotype_call_dataset_fetcher(
    *,
    allele_pos_key: str = 'variants/POS',
    allele_ref_key: str = 'variants/REF',
    allele_alt_key: str = 'variants/ALT',
    variant_id_key: str = 'variants/ID', 
    # potential alternate is 'calldata/genotype'
    genotype_call_key : str = 'calldata/GT',
    sample_key: str = 'samples',
    grouped_by_contig: bool = True,
    contigs: Optional[List[str]] = None
) -> xr.Dataset:
    """
    Define some keys to read in the zarrVCF
    
    If contigs is None we'll try to guess the contigs from the top level keys
    """
    
    attrs: Dict[Hashable, Any] = {
        "variant_id_key": variant_id_key,
        "allele_pos_key": allele_pos_key, 
        "allele_ref_key": allele_ref_key,
        "allele_alt_key": allele_alt_key,
        "genotype_call_key": genotype_call_key,
        "grouped_by_contig": grouped_by_contig,
        "sample_key": sample_key,
        "contigs": contigs
    
    }
    return xr.Dataset(data_vars={}, attrs=attrs)

The justification for having an additional dataset is that we can have a grouped True/False and that the actual zarr keys are somewhat different from phase1 -> phase2, and as far as I know there are no restrictions and could be anything. This would allow the end user to have some flexibility.

Then I have a WIP get_grouped function.

def get_grouped(fetcher, store, samples, contigs, ploidy=2):
    """Assume Zarr is grouped by chromosome and get the data """
    
    # get the reference positions
    id_data = []
    ref_data = []
    alt_data = []
    pos_data = []
    contig_pos_data = []

    # get the calldata
    genotype_data = []
    encoded_contigs = encode_array(contigs)
    
    
    for i, contig in enumerate(contigs):
        
        # Get the alleles

            
        if '{}/{}'.format(contig, fetcher.allele_ref_key) in store:
            ref_data.append(da.from_zarr(store['{}/{}'.format(contig, fetcher.allele_ref_key)]))
        else:
            raise Exception('Reference data key: {} not found'.format(contig, fetcher.allele_ref_key))
        
        if '{}/{}'.format(contig, fetcher.allele_alt_key) in store:
            alt_data.append(da.from_zarr(store['{}/{}'.format(contig, fetcher.allele_alt_key)]))
        else:
            raise Exception('Alt data key: {} not found'.format(contig, fetcher.allele_alt_key))

        # Get the allele positions
        if '{}/{}'.format(contig, fetcher.allele_pos_key) in store:
            pos_data.append(da.from_zarr(store['{}/{}'.format(contig, fetcher.allele_pos_key)]))
        else:
            raise Exception('Pos data key: {} not found'.format(contig, fetcher.allele_alt_key))
        
        
        # get the variant_contig as an integer from the encoded_contigs
        last_pos_data = pos_data[len(pos_data) - 1]
        t = np.zeros(len(last_pos_data), dtype='i')
        t.fill(encoded_contigs[0][i])
        # doesn't have to be dask array
        contig_pos_data.append(t)

        # Dataset I'm working with doesn't have an ID key 
        if '{}/{}'.format(contig, fetcher.variant_id_key) in store:
            id_data.append(da.from_zarr(store['{}/{}'.format(contig, fetcher.variant_id_key)]).astype(str))
        else:
            t = np.arange(len(last_pos_data)).astype(str)
            
            id_data.append( da.from_array(t) )
        
        # Get the genotype calls
        if '{}/{}'.format(contig, fetcher.genotype_call_key) in store:
            len_expected = pos_data[len(pos_data) - 1]
            gt = da.from_zarr(store['{}/{}'.format(contig, fetcher.genotype_call_key)])
            # TODO create sanity checks
            assert len(gt.shape), 3
            assert gt.shape[0], len_expected
            assert gt.shape[1], len(samples)
            assert gt.shape[2], ploidy
                
            genotype_data.append(gt)
            break
    
    # TODO create sanity checks and raise appropriate errors
    assert len(ref_data), len(contigs)
    assert len(alt_data), len(contigs)
    assert len(pos_data), len(contigs)
    assert len(genotype_data), len(contigs)
    
    contig_pos_data = np.concatenate(contig_pos_data)
    variant_contig = np.array(contig_pos_data).astype("int16")
    variant_contig_names = list(contigs)
    
    # For variant alleles, combine REF and ALT into a single array
    variants_ref = da.concatenate(ref_data)
    variants_alt = da.concatenate(alt_data)
    
    variant_alleles = da.concatenate(
        [_ensure_2d(variants_ref), _ensure_2d(variants_alt)], axis=1
    )

    variants_id = da.concatenate(id_data)
    variants_pos = da.concatenate(pos_data)
    call_genotype = da.concatenate(genotype_data)
    
    ds = create_genotype_call_dataset(
        variant_contig_names=np.array(contigs).astype(str),
        variant_contig=variant_contig,
        variant_position=variants_pos,
        variant_alleles=variant_alleles,
        sample_id=da.from_zarr(vcfzarr["samples"]).astype(str),
        call_genotype=call_genotype,
        variant_id=variants_id,
    )

    return ds


                            
def get_samples(fetcher, store):
    # TODO There needs to be checks here to make sure we're using the right datatype
    return da.from_zarr(store[fetcher.sample_key]).astype(str)


def guess_contigs(store):
    """
    Most zarr documents have a top level tree where the keys are the contigs, plus a 'samples key'
    """
    keys = list(store.keys())
    if 'samples' in keys:
        keys.remove('samples')
    return keys

Which a user would then call as:

document = create_genotype_call_dataset_fetcher()

# get all the contigs!
variant_contig = ['2L', '2R', '3L', '3R', 'X',]

# get one small contig and make sure nothing is broken!
variant_contig = ['X']

ds = get_grouped(document, vcfzarr, get_samples(document, vcfzarr), variant_contig)
ds

jeromekelleher · 2020-08-12T16:16:33Z

Thanks @jerowe - just updating to say I think we should wait until @alimanfoo is back before making any decisions on what we do with this.

tomwhite mentioned this issue Jul 21, 2020

Add read_vcfzarr #40

Merged

hammer added the IO Issues related to reading and writing common third-party file formats label Jul 21, 2020

hammer mentioned this issue Aug 5, 2020

[WIP] Docs describing the Genotype Call XArray #78

Closed

tomwhite mentioned this issue Oct 13, 2020

Vcfzarr concatenate and rechunk #324

Merged

tomwhite closed this as completed May 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VCF Zarr improvements #56

VCF Zarr improvements #56

tomwhite commented Jul 21, 2020 •

edited

Loading

jerowe commented Aug 12, 2020

jeromekelleher commented Aug 12, 2020

VCF Zarr improvements #56

VCF Zarr improvements #56

Comments

tomwhite commented Jul 21, 2020 • edited Loading

jerowe commented Aug 12, 2020

jeromekelleher commented Aug 12, 2020

tomwhite commented Jul 21, 2020 •

edited

Loading