Skip to content

Method for grouping samples which is understood by library functions #224

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jeromekelleher opened this issue Sep 1, 2020 · 7 comments
Closed
Labels
core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc.

Comments

@jeromekelleher
Copy link
Collaborator

Population structure analysis depends on us being able to assign labels to samples. In this thread we discussed the possibility of using individual datasets, one for each group, but grouping by a categorical variable is much more flexible and idiomatic.

We need to develop some conventions which allows the user to do some standard grouping of samples, and functions which use these groupings (Fst, or divergence, for example), should understand these conventions and update the output dataset accordingly.

It may also be worth thinking about how we might group the variants dimension while we're thinking about it. For example, we might want to get the pairwise Fst values for all pairs of populations, for all chromosomes in a dataset. It would be super-nice if we had an idiomatic way of running these calculations.

@eric-czech - you had some concrete ideas on this in the earlier thread, what are your thoughts?

@hammer hammer added the core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. label Sep 1, 2020
@eric-czech
Copy link
Collaborator

eric-czech commented Sep 1, 2020

@eric-czech - you had some concrete ideas on this in the earlier thread, what are your thoughts?

Using sample_group as the default variable name along with a groups dimension as the convention would align nicely with our other naming conventions. I don't see too much harm in appropriating the "groups" name for the samples dimension since it will be less common with variants. It would be nice if there was something more specific like "populations" since up until now dimension names have been fairly unambiguous. Using "groups" as the dimension name would strengthen the case for making both variable and dimension names configurable.

Renaming dimensions to meet the requirements of our functions isn't hard either so this choice isn't that important IMO, but I would mildly favor something like cohorts (variable = sample_cohort) rather than groups. Do you see any issues with that @jeromekelleher?

As far as actual computations are concerned, I think statistics that can be computed for cohorts independently from one another are a great fit for the .groupby.map idiom.

For something like computing pairwise stats for sample groups by chromosome, this is the most idiomatic approach I can think of if the sample sizes aren't equal and there is no guarantee on sorting:

import itertools
import xarray as xr
import numpy as np
from sgkit.testing import simulate_genotype_call_dataset

# Simulate 2 contigs
ds = simulate_genotype_call_dataset(n_variant=100, n_sample=30, n_contig=2, seed=0)
# Create 3 arbitrary cohorts
ds['sample_cohort'] = xr.DataArray(np.repeat([0, 1, 2], ds.dims['samples'] // 3), dims='samples')

def fn2(g):
    g1, g2 = g
    print(f'Comparing cohorts {g1[0]} <-> {g2[0]} for contig {g1[1].variant_contig.item(0)}')
    # Produce a single scalar
    return (g1[1].call_genotype - g2[1].call_genotype).mean()

def fn1(g):
    groups = itertools.combinations(g.groupby('sample_cohort'), 2)
    return xr.concat(map(fn2, groups), dim='cohort_pairs')

ds.groupby('variant_contig').map(fn1)

Comparing cohorts 0 <-> 1 for contig 0
Comparing cohorts 0 <-> 2 for contig 0
Comparing cohorts 1 <-> 2 for contig 0
Comparing cohorts 0 <-> 1 for contig 1
Comparing cohorts 0 <-> 2 for contig 1
Comparing cohorts 1 <-> 2 for contig 1

<xarray.DataArray 'call_genotype' (variant_contig: 2, cohort_pairs: 3)>
array([[ 0.047,  0.004, -0.043],
       [ 0.008, -0.005, -0.013]])

I doubt that will scale well out-of-core but I'm not sure what would.

@timothymillar
Copy link
Collaborator

+1 for grouping within a single dataset, it will simplify some potentially complex operations.
For example if a dataset containing multiple sample groups includes a kinship matrix of shape (sample, sample) it is trivial to retrieve both the within group values and the among group values.
If there is a dataset per group then it is unclear how to include pairwise kinship among samples of different groups.

@jeromekelleher
Copy link
Collaborator Author

Thanks @eric-czech - cohort is an excellent choice. It's clearly about samples, but also quite flexible in that it just means "samples with some shared characteristic".

@jeromekelleher
Copy link
Collaborator Author

jeromekelleher commented Sep 17, 2020

Some notes on the discussion from today's call

Output format

When we have n cohorts and are computing some subset of the n choose 2 pairs (for pairwise functions like Fst), there are two natural ways in which we might structure the output:

  1. Add n choose 2 variables to the output dataset, which have names like stats_Fst_0_1, stats_Fst_0_2, etc; or
  2. Return single variable stats_Fst which is a square matrix, such that stats_Fst[0, 1] is the value for cohort ID 0 and ID 1.

Option 2 seems more elegant and would have a lot of advantages.

Cohort naming and metadata

It's error prone to force users into using integer IDs as the cohort identifiers - sometimes we'd like much prefer to have names like "CEU" like 1000G populations. I propose that we have a cohort "table" that is stored in the dataset which is essentially two arrays of length n (for n cohorts): name and metadata. The name array is an array of strings, and should contain unique strings which identify each cohort. I'd advocate for requiring names to be valid Python identifiers. The metadata array is for storing stuff the user wants to store about the cohort, so I guess should be a generic object array? I guess we should also consider having a description for each population, just so the datasets can be self-documenting.

Specifying cohort subsets

By default, functions which operate over cohorts should perform all cohort comparisons. For example, in Fst, by default we should compute all pairwise Fst values between cohorts. We won't always want to do this, though, so some way of subsetting this is needed.

We could have something like this rough sketch:

def Fst(ds, cohorts=None):
    if cohorts is None: 
         cohorts = list(itertools.combinations(range(ds.num_cohorts, 2)))
         # Pairs of integer IDs, [(0, 0), (0, 1), ...]
    else:
         # The input must be a list of tuples, each specifying a pair of either 
         # integer cohort IDs or string cohort names
         parsed_cohorts = []
         for pair in cohorts:
             parsed_pair = []
             for cohort_ref in pair:
                  if isinstance(cohort_ref, int):
                       # It's already an int, so great.
                       parsed_pair.append(cohort_ref)
                  else:
                       # We assume this is a string cohort name which can be mapped to an integer ID
                      parsed_pair.append(ds.get_cohort_id(cohort_ref))
            cohorts = parsed_cohorts
     # Then compute Fst between the pairs in cohorts and return an k x k matrix.

@hammer
Copy link
Contributor

hammer commented Oct 15, 2020

Should we close this out with #260 merged, or is there more to discuss?

@tomwhite
Copy link
Collaborator

We still need to implement cohort subsets (and add some docs), so I'd like to leave this open until that's done.

This was referenced Nov 19, 2020
@tomwhite
Copy link
Collaborator

Closing this now that #394 and #404 have been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc.
Projects
None yet
Development

No branches or pull requests

5 participants