Method for grouping samples which is understood by library functions #224

jeromekelleher · 2020-09-01T15:18:12Z

Population structure analysis depends on us being able to assign labels to samples. In this thread we discussed the possibility of using individual datasets, one for each group, but grouping by a categorical variable is much more flexible and idiomatic.

We need to develop some conventions which allows the user to do some standard grouping of samples, and functions which use these groupings (Fst, or divergence, for example), should understand these conventions and update the output dataset accordingly.

It may also be worth thinking about how we might group the variants dimension while we're thinking about it. For example, we might want to get the pairwise Fst values for all pairs of populations, for all chromosomes in a dataset. It would be super-nice if we had an idiomatic way of running these calculations.

@eric-czech - you had some concrete ideas on this in the earlier thread, what are your thoughts?

The text was updated successfully, but these errors were encountered:

eric-czech · 2020-09-01T18:41:24Z

@eric-czech - you had some concrete ideas on this in the earlier thread, what are your thoughts?

Using sample_group as the default variable name along with a groups dimension as the convention would align nicely with our other naming conventions. I don't see too much harm in appropriating the "groups" name for the samples dimension since it will be less common with variants. It would be nice if there was something more specific like "populations" since up until now dimension names have been fairly unambiguous. Using "groups" as the dimension name would strengthen the case for making both variable and dimension names configurable.

Renaming dimensions to meet the requirements of our functions isn't hard either so this choice isn't that important IMO, but I would mildly favor something like cohorts (variable = sample_cohort) rather than groups. Do you see any issues with that @jeromekelleher?

As far as actual computations are concerned, I think statistics that can be computed for cohorts independently from one another are a great fit for the .groupby.map idiom.

For something like computing pairwise stats for sample groups by chromosome, this is the most idiomatic approach I can think of if the sample sizes aren't equal and there is no guarantee on sorting:

import itertools
import xarray as xr
import numpy as np
from sgkit.testing import simulate_genotype_call_dataset

# Simulate 2 contigs
ds = simulate_genotype_call_dataset(n_variant=100, n_sample=30, n_contig=2, seed=0)
# Create 3 arbitrary cohorts
ds['sample_cohort'] = xr.DataArray(np.repeat([0, 1, 2], ds.dims['samples'] // 3), dims='samples')

def fn2(g):
    g1, g2 = g
    print(f'Comparing cohorts {g1[0]} <-> {g2[0]} for contig {g1[1].variant_contig.item(0)}')
    # Produce a single scalar
    return (g1[1].call_genotype - g2[1].call_genotype).mean()

def fn1(g):
    groups = itertools.combinations(g.groupby('sample_cohort'), 2)
    return xr.concat(map(fn2, groups), dim='cohort_pairs')

ds.groupby('variant_contig').map(fn1)

Comparing cohorts 0 <-> 1 for contig 0
Comparing cohorts 0 <-> 2 for contig 0
Comparing cohorts 1 <-> 2 for contig 0
Comparing cohorts 0 <-> 1 for contig 1
Comparing cohorts 0 <-> 2 for contig 1
Comparing cohorts 1 <-> 2 for contig 1

<xarray.DataArray 'call_genotype' (variant_contig: 2, cohort_pairs: 3)>
array([[ 0.047,  0.004, -0.043],
       [ 0.008, -0.005, -0.013]])

I doubt that will scale well out-of-core but I'm not sure what would.

timothymillar · 2020-09-01T20:51:28Z

+1 for grouping within a single dataset, it will simplify some potentially complex operations.
For example if a dataset containing multiple sample groups includes a kinship matrix of shape (sample, sample) it is trivial to retrieve both the within group values and the among group values.
If there is a dataset per group then it is unclear how to include pairwise kinship among samples of different groups.

jeromekelleher · 2020-09-02T11:39:50Z

Thanks @eric-czech - cohort is an excellent choice. It's clearly about samples, but also quite flexible in that it just means "samples with some shared characteristic".

jeromekelleher · 2020-09-17T16:29:06Z

Some notes on the discussion from today's call

Output format

When we have n cohorts and are computing some subset of the n choose 2 pairs (for pairwise functions like Fst), there are two natural ways in which we might structure the output:

Add n choose 2 variables to the output dataset, which have names like stats_Fst_0_1, stats_Fst_0_2, etc; or
Return single variable stats_Fst which is a square matrix, such that stats_Fst[0, 1] is the value for cohort ID 0 and ID 1.

Option 2 seems more elegant and would have a lot of advantages.

Cohort naming and metadata

It's error prone to force users into using integer IDs as the cohort identifiers - sometimes we'd like much prefer to have names like "CEU" like 1000G populations. I propose that we have a cohort "table" that is stored in the dataset which is essentially two arrays of length n (for n cohorts): name and metadata. The name array is an array of strings, and should contain unique strings which identify each cohort. I'd advocate for requiring names to be valid Python identifiers. The metadata array is for storing stuff the user wants to store about the cohort, so I guess should be a generic object array? I guess we should also consider having a description for each population, just so the datasets can be self-documenting.

Specifying cohort subsets

By default, functions which operate over cohorts should perform all cohort comparisons. For example, in Fst, by default we should compute all pairwise Fst values between cohorts. We won't always want to do this, though, so some way of subsetting this is needed.

We could have something like this rough sketch:

def Fst(ds, cohorts=None):
    if cohorts is None: 
         cohorts = list(itertools.combinations(range(ds.num_cohorts, 2)))
         # Pairs of integer IDs, [(0, 0), (0, 1), ...]
    else:
         # The input must be a list of tuples, each specifying a pair of either 
         # integer cohort IDs or string cohort names
         parsed_cohorts = []
         for pair in cohorts:
             parsed_pair = []
             for cohort_ref in pair:
                  if isinstance(cohort_ref, int):
                       # It's already an int, so great.
                       parsed_pair.append(cohort_ref)
                  else:
                       # We assume this is a string cohort name which can be mapped to an integer ID
                      parsed_pair.append(ds.get_cohort_id(cohort_ref))
            cohorts = parsed_cohorts
     # Then compute Fst between the pairs in cohorts and return an k x k matrix.

hammer · 2020-10-15T14:49:23Z

Should we close this out with #260 merged, or is there more to discuss?

tomwhite · 2020-10-15T14:55:51Z

We still need to implement cohort subsets (and add some docs), so I'd like to leave this open until that's done.

tomwhite · 2020-11-30T09:48:40Z

Closing this now that #394 and #404 have been merged.

jeromekelleher mentioned this issue Sep 1, 2020

Fst function #225

Closed

hammer added the core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. label Sep 1, 2020

This was referenced Sep 1, 2020

Requirements for population structure analysis #226

Open

[WIP] Popgen stats #100

Merged

This was referenced Sep 2, 2020

Genome-wide selection scans #232

Closed

Update Fst and Tajima's D functions to use grouping convention #240

Closed

tomwhite mentioned this issue Sep 15, 2020

Cohort grouping for popgen #260

Merged

This was referenced Nov 19, 2020

Cohort subsets #394

Merged

Add docs for windowing and cohorts #404

Merged

tomwhite closed this as completed Nov 30, 2020

hyanwong mentioned this issue Nov 17, 2021

Should we always set a metadata "name" for populations in tsinfer? tskit-dev/tsinfer#609

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Method for grouping samples which is understood by library functions #224

Method for grouping samples which is understood by library functions #224

jeromekelleher commented Sep 1, 2020

eric-czech commented Sep 1, 2020 •

edited

Loading

timothymillar commented Sep 1, 2020

jeromekelleher commented Sep 2, 2020

jeromekelleher commented Sep 17, 2020 •

edited

Loading

hammer commented Oct 15, 2020

tomwhite commented Oct 15, 2020

tomwhite commented Nov 30, 2020

Method for grouping samples which is understood by library functions #224

Method for grouping samples which is understood by library functions #224

Comments

jeromekelleher commented Sep 1, 2020

eric-czech commented Sep 1, 2020 • edited Loading

timothymillar commented Sep 1, 2020

jeromekelleher commented Sep 2, 2020

jeromekelleher commented Sep 17, 2020 • edited Loading

Output format

Cohort naming and metadata

Specifying cohort subsets

hammer commented Oct 15, 2020

tomwhite commented Oct 15, 2020

tomwhite commented Nov 30, 2020

eric-czech commented Sep 1, 2020 •

edited

Loading

jeromekelleher commented Sep 17, 2020 •

edited

Loading