-
Notifications
You must be signed in to change notification settings - Fork 35
Method for grouping samples which is understood by library functions #224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Using Renaming dimensions to meet the requirements of our functions isn't hard either so this choice isn't that important IMO, but I would mildly favor something like As far as actual computations are concerned, I think statistics that can be computed for cohorts independently from one another are a great fit for the For something like computing pairwise stats for sample groups by chromosome, this is the most idiomatic approach I can think of if the sample sizes aren't equal and there is no guarantee on sorting: import itertools
import xarray as xr
import numpy as np
from sgkit.testing import simulate_genotype_call_dataset
# Simulate 2 contigs
ds = simulate_genotype_call_dataset(n_variant=100, n_sample=30, n_contig=2, seed=0)
# Create 3 arbitrary cohorts
ds['sample_cohort'] = xr.DataArray(np.repeat([0, 1, 2], ds.dims['samples'] // 3), dims='samples')
def fn2(g):
g1, g2 = g
print(f'Comparing cohorts {g1[0]} <-> {g2[0]} for contig {g1[1].variant_contig.item(0)}')
# Produce a single scalar
return (g1[1].call_genotype - g2[1].call_genotype).mean()
def fn1(g):
groups = itertools.combinations(g.groupby('sample_cohort'), 2)
return xr.concat(map(fn2, groups), dim='cohort_pairs')
ds.groupby('variant_contig').map(fn1)
Comparing cohorts 0 <-> 1 for contig 0
Comparing cohorts 0 <-> 2 for contig 0
Comparing cohorts 1 <-> 2 for contig 0
Comparing cohorts 0 <-> 1 for contig 1
Comparing cohorts 0 <-> 2 for contig 1
Comparing cohorts 1 <-> 2 for contig 1
<xarray.DataArray 'call_genotype' (variant_contig: 2, cohort_pairs: 3)>
array([[ 0.047, 0.004, -0.043],
[ 0.008, -0.005, -0.013]]) I doubt that will scale well out-of-core but I'm not sure what would. |
+1 for grouping within a single dataset, it will simplify some potentially complex operations. |
Thanks @eric-czech - |
Some notes on the discussion from today's call Output formatWhen we have n cohorts and are computing some subset of the n choose 2 pairs (for pairwise functions like Fst), there are two natural ways in which we might structure the output:
Option 2 seems more elegant and would have a lot of advantages. Cohort naming and metadataIt's error prone to force users into using integer IDs as the cohort identifiers - sometimes we'd like much prefer to have names like "CEU" like 1000G populations. I propose that we have a cohort "table" that is stored in the dataset which is essentially two arrays of length n (for n cohorts): Specifying cohort subsetsBy default, functions which operate over cohorts should perform all cohort comparisons. For example, in Fst, by default we should compute all pairwise Fst values between cohorts. We won't always want to do this, though, so some way of subsetting this is needed. We could have something like this rough sketch: def Fst(ds, cohorts=None):
if cohorts is None:
cohorts = list(itertools.combinations(range(ds.num_cohorts, 2)))
# Pairs of integer IDs, [(0, 0), (0, 1), ...]
else:
# The input must be a list of tuples, each specifying a pair of either
# integer cohort IDs or string cohort names
parsed_cohorts = []
for pair in cohorts:
parsed_pair = []
for cohort_ref in pair:
if isinstance(cohort_ref, int):
# It's already an int, so great.
parsed_pair.append(cohort_ref)
else:
# We assume this is a string cohort name which can be mapped to an integer ID
parsed_pair.append(ds.get_cohort_id(cohort_ref))
cohorts = parsed_cohorts
# Then compute Fst between the pairs in cohorts and return an k x k matrix. |
Should we close this out with #260 merged, or is there more to discuss? |
We still need to implement cohort subsets (and add some docs), so I'd like to leave this open until that's done. |
Population structure analysis depends on us being able to assign labels to samples. In this thread we discussed the possibility of using individual datasets, one for each group, but grouping by a categorical variable is much more flexible and idiomatic.
We need to develop some conventions which allows the user to do some standard grouping of samples, and functions which use these groupings (
Fst
, ordivergence
, for example), should understand these conventions and update the output dataset accordingly.It may also be worth thinking about how we might group the variants dimension while we're thinking about it. For example, we might want to get the pairwise Fst values for all pairs of populations, for all chromosomes in a dataset. It would be super-nice if we had an idiomatic way of running these calculations.
@eric-czech - you had some concrete ideas on this in the earlier thread, what are your thoughts?
The text was updated successfully, but these errors were encountered: