Skip to content

VCF Zarr improvements #56

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 4 tasks
tomwhite opened this issue Jul 21, 2020 · 2 comments
Closed
2 of 4 tasks

VCF Zarr improvements #56

tomwhite opened this issue Jul 21, 2020 · 2 comments
Labels
IO Issues related to reading and writing common third-party file formats

Comments

@tomwhite
Copy link
Collaborator

tomwhite commented Jul 21, 2020

It would be good to make VCF Zarr reading more flexible:

@hammer hammer added the IO Issues related to reading and writing common third-party file formats label Jul 21, 2020
@jerowe
Copy link

jerowe commented Aug 12, 2020

Hi there. I started on this today.

@alimanfoo gave us some access to vfczarr data. I'm using that as the basis to read in data.

It looks like it the data is for all contigs you have:

2L
 ├── calldata
 │   └── GT (11524923, 1142, 2) int8
 ├── samples (1142,) object
 └── variants
     ├── ABHet (11524923,) float32
...

3L
 ├── calldata
 │   └── GT (11524923, 1142, 2) int8
 ├── samples (1142,) object
 └── variants
     ├── ABHet (11524923,) float32

Where 2L and 3L are the contig names, along with an additional samples key that is an array (that I think could optionally be a group).

It can also be the case that you have 1 zarr file per contig, in which case it looks like this:

 ├── calldata
 │   └── GT (11524923, 1142, 2) int8
 ├── samples (1142,) object
 └── variants
     ├── ABHet (11524923,) float32

So I broke this functionality down into 2 parts. The first is defining an Xarray dataset that we can use as a very minimal interface.

## Define a GenotypeCallSetDocument
  
from typing import Any, Dict, Hashable, List, Optional
import xarray as xr

def create_genotype_call_dataset_fetcher(
    *,
    allele_pos_key: str = 'variants/POS',
    allele_ref_key: str = 'variants/REF',
    allele_alt_key: str = 'variants/ALT',
    variant_id_key: str = 'variants/ID', 
    # potential alternate is 'calldata/genotype'
    genotype_call_key : str = 'calldata/GT',
    sample_key: str = 'samples',
    grouped_by_contig: bool = True,
    contigs: Optional[List[str]] = None
) -> xr.Dataset:
    """
    Define some keys to read in the zarrVCF
    
    If contigs is None we'll try to guess the contigs from the top level keys
    """
    
    attrs: Dict[Hashable, Any] = {
        "variant_id_key": variant_id_key,
        "allele_pos_key": allele_pos_key, 
        "allele_ref_key": allele_ref_key,
        "allele_alt_key": allele_alt_key,
        "genotype_call_key": genotype_call_key,
        "grouped_by_contig": grouped_by_contig,
        "sample_key": sample_key,
        "contigs": contigs
    
    }
    return xr.Dataset(data_vars={}, attrs=attrs)

The justification for having an additional dataset is that we can have a grouped True/False and that the actual zarr keys are somewhat different from phase1 -> phase2, and as far as I know there are no restrictions and could be anything. This would allow the end user to have some flexibility.

Then I have a WIP get_grouped function.

def get_grouped(fetcher, store, samples, contigs, ploidy=2):
    """Assume Zarr is grouped by chromosome and get the data """
    
    # get the reference positions
    id_data = []
    ref_data = []
    alt_data = []
    pos_data = []
    contig_pos_data = []

    # get the calldata
    genotype_data = []
    encoded_contigs = encode_array(contigs)
    
    
    for i, contig in enumerate(contigs):
        
        # Get the alleles

            
        if '{}/{}'.format(contig, fetcher.allele_ref_key) in store:
            ref_data.append(da.from_zarr(store['{}/{}'.format(contig, fetcher.allele_ref_key)]))
        else:
            raise Exception('Reference data key: {} not found'.format(contig, fetcher.allele_ref_key))
        
        if '{}/{}'.format(contig, fetcher.allele_alt_key) in store:
            alt_data.append(da.from_zarr(store['{}/{}'.format(contig, fetcher.allele_alt_key)]))
        else:
            raise Exception('Alt data key: {} not found'.format(contig, fetcher.allele_alt_key))

        # Get the allele positions
        if '{}/{}'.format(contig, fetcher.allele_pos_key) in store:
            pos_data.append(da.from_zarr(store['{}/{}'.format(contig, fetcher.allele_pos_key)]))
        else:
            raise Exception('Pos data key: {} not found'.format(contig, fetcher.allele_alt_key))
        
        
        # get the variant_contig as an integer from the encoded_contigs
        last_pos_data = pos_data[len(pos_data) - 1]
        t = np.zeros(len(last_pos_data), dtype='i')
        t.fill(encoded_contigs[0][i])
        # doesn't have to be dask array
        contig_pos_data.append(t)

        # Dataset I'm working with doesn't have an ID key 
        if '{}/{}'.format(contig, fetcher.variant_id_key) in store:
            id_data.append(da.from_zarr(store['{}/{}'.format(contig, fetcher.variant_id_key)]).astype(str))
        else:
            t = np.arange(len(last_pos_data)).astype(str)
            
            id_data.append( da.from_array(t) )
        
        # Get the genotype calls
        if '{}/{}'.format(contig, fetcher.genotype_call_key) in store:
            len_expected = pos_data[len(pos_data) - 1]
            gt = da.from_zarr(store['{}/{}'.format(contig, fetcher.genotype_call_key)])
            # TODO create sanity checks
            assert len(gt.shape), 3
            assert gt.shape[0], len_expected
            assert gt.shape[1], len(samples)
            assert gt.shape[2], ploidy
                
            genotype_data.append(gt)
            break
    
    # TODO create sanity checks and raise appropriate errors
    assert len(ref_data), len(contigs)
    assert len(alt_data), len(contigs)
    assert len(pos_data), len(contigs)
    assert len(genotype_data), len(contigs)
    
    contig_pos_data = np.concatenate(contig_pos_data)
    variant_contig = np.array(contig_pos_data).astype("int16")
    variant_contig_names = list(contigs)
    
    # For variant alleles, combine REF and ALT into a single array
    variants_ref = da.concatenate(ref_data)
    variants_alt = da.concatenate(alt_data)
    
    variant_alleles = da.concatenate(
        [_ensure_2d(variants_ref), _ensure_2d(variants_alt)], axis=1
    )

    variants_id = da.concatenate(id_data)
    variants_pos = da.concatenate(pos_data)
    call_genotype = da.concatenate(genotype_data)
    
    ds = create_genotype_call_dataset(
        variant_contig_names=np.array(contigs).astype(str),
        variant_contig=variant_contig,
        variant_position=variants_pos,
        variant_alleles=variant_alleles,
        sample_id=da.from_zarr(vcfzarr["samples"]).astype(str),
        call_genotype=call_genotype,
        variant_id=variants_id,
    )

    return ds


                            
def get_samples(fetcher, store):
    # TODO There needs to be checks here to make sure we're using the right datatype
    return da.from_zarr(store[fetcher.sample_key]).astype(str)


def guess_contigs(store):
    """
    Most zarr documents have a top level tree where the keys are the contigs, plus a 'samples key'
    """
    keys = list(store.keys())
    if 'samples' in keys:
        keys.remove('samples')
    return keys

Which a user would then call as:

document = create_genotype_call_dataset_fetcher()

# get all the contigs!
variant_contig = ['2L', '2R', '3L', '3R', 'X',]

# get one small contig and make sure nothing is broken!
variant_contig = ['X']

ds = get_grouped(document, vcfzarr, get_samples(document, vcfzarr), variant_contig)
ds

@jeromekelleher
Copy link
Collaborator

Thanks @jerowe - just updating to say I think we should wait until @alimanfoo is back before making any decisions on what we do with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Issues related to reading and writing common third-party file formats
Projects
None yet
Development

No branches or pull requests

4 participants