Windowing along the genome #229

jeromekelleher · 2020-09-02T14:10:05Z

Computing statistics in windows along the genome is a basic requirement. For example, users will often want to compute Fst values in (say) 100kb chunks along the genome. I think there's two basic approaches we could take here:

Add a "windows" argument to all functions

Add an argument windows to all of the functions in which this would make sense, which would follow a similar logic to how tskit does it. I think this is better than the approach taken in scikit-allel, where we have functions like windowed_patterson_fst.

To make things concrete, here's a rough idea of what Fst would look like:

def parse_windows(ds, windows)
      L = largest_variant_position(ds) + 1 # assuming we can get this
      if windows is None:
          # By default we have one window
           windows = [0, L]    
      elif isinstance(windows, str):
            # Obviously this is stupid and we'd parse out the suffix
            if windows == "100kb":
                step_size = 100_000
           else:
                 raise ValueError("Unrecognised window description")
           windows = np.arange(0, L, step=step_size)
     elif isinstance(windows, int):   # This is brittle and we'd probably do something better in reality
          # Interpret the argument as asking for n windows
          windows = np.linspace(0, L, num=windows)
     # Otherwise, assume that windows is a 1D array of n values describing n intervals along the genome.
     return windows
     

def Fst(ds, *, windows=None):
      windows = parse_windows
      # compute Fst in variants in these windows and return an array/scalar depending on the input.

So, in the most general case, windows is a 1D array of n coordinates describing n - 1 intervals along the genome. The intervals are half-open (left-inclusive, right-exclusive). We also provide some other handy ways of specifying common types of windows by interpreting different argument types differently. These bells and whistles probably aren't necessary initially, though.

Store stats and window afterwards

The previous formulation assumes that what we're returning is a single value or numpy array, but this is somewhat at odds with current thinking. In #103 we are assuming that we update the input data set with a variable for each thing that we compute on the dataset. Thus, an alternative way we might do things is that we compute the value of Fst for each variant and store the Fst array in the result dataset. Then, windowing is something that we do afterwards, by aggegating values. E.g.,

ds = sgkit.Fst(ds)  # ds now contains a "stats_Fst" variable, or similar
windowed_ds = sgkit.window_stats(ds, "100kb")

I guess this latter approach is a more flexible and idiomatic?

The text was updated successfully, but these errors were encountered:

jeromekelleher · 2020-09-02T14:36:49Z

We should also consider rolling windows, as this seems to be needed for things in PBS (#230)

eric-czech · 2020-09-02T16:45:40Z

FYI LD pruning (https://github.com/pystatgen/sgkit/issues/31#issuecomment-671324407) has similar requirements and the solution I had implemented for computing rolling base pair (or fixed) with some step size was to:

Have some compiled function produce a dataset containing the start and stop variant indexes as well as values necessary for specifying uneven overlapping chunked computations
Write custom dask delayed tasks that partition blocks based on the chunk windows implied above and feed the individual variant slice data to the computation on that chunk

Unfortunately, there's nothing in the Dask API for windows that aren't fixed in size so custom tasks are likely unavoidable.

eric-czech · 2020-09-02T16:59:51Z

One approach to this that I think could generalize well in the context of Dask is to have a function similar to hail.locus_interval. We have the additional wrinkle of needing to worry about the chunking so I had made something like this:

# `window` = base pairs or numbers of variants
variant_intervals, chunk_intervals = api.axis_intervals(ds, window=100_000, unit='physical')
# `group` = chromosome
# `index` = center of window
# `start` = left side of window
# `stop` = right side of window
variant_intervals.to_dataset('var').to_dataframe().sample(10, random_state=1)

The chunk interval data is similar, but also needed to be aware of the number of windows within a chunk in order to pre-allocate array results when using a GPU kernel to process a chunk.

Anyways, that could be a good way to orient the discussion for windowing since I think a specific function for it would be ideal rather than trying to bake it into individual methods.

+1 to something like windowed_ds = sgkit.window_stats(ds, "100kb") that would wrap up the chunking details for sure.

tomwhite · 2020-09-03T14:27:48Z

So, in the most general case, windows is a 1D array of n coordinates describing n - 1 intervals along the genome.

Is there a non-contiguous case where some regions are ignored, or where windows overlap? In which case, you'd have 1D arrays of window_starts and window_ends (or window_sizes).

jeromekelleher · 2020-09-03T14:36:17Z

It's a good point @tomwhite. What do we do with centromeres, say? Would it be better to mask these out afterwards?

I think for the plotting-things-along-the-genome case, contiguous intervals are what we want the vast majority of the time, but maybe we do want something more general.

tomwhite · 2020-09-29T15:13:17Z

an alternative way we might do things is that we compute the value of Fst for each variant and store the Fst array in the result dataset. Then, windowing is something that we do afterwards, by aggegating values.

I'm not sure this will work for Fst, since windowing per-variant Fst values is not the same as computing Fst on windowed diversity and divergence values. And I think the latter is what we actually want (please correct me if I'm wrong).

So while this would mean Fst (at least) would have to take a windows argument, it might still be useful to have a window_stats utility function for other uses.

jeromekelleher · 2020-09-29T15:36:16Z

Hmm, excellent point @tomwhite; you're right, that's what both tskit and scikit-allel are doing for Fst. So, really, there's no option but to have a windows argument there, as you say.

tomwhite · 2020-10-01T14:13:23Z

Thinking about this more, I wonder if windows should be defined as an xarray Dataset variable (or actually a set of variables, for start, stop, etc). There would be a set of functions for creating windows: by fixed number of variants, genomic intervals, accessible bases, bed files - or even user defined ones. Then method functions (like diversity, divergence, Fst, ld_prune) would use a named window variable to do windowing. This would fit nicely into the way we are already doing things, since windows would just be another input variable to these method functions.

jeromekelleher · 2020-10-01T14:58:12Z

Nice idea @tomwhite, I like it! A big advantage of this is the that the coordinates go along with the values that are computed for plotting.

tomwhite · 2020-11-30T09:50:10Z

Closing this now that the basic mechanism is working and documented (#303, #404). There is follow-up work in #341.

jeromekelleher added the core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. label Sep 2, 2020

This was referenced Sep 2, 2020

PBS selection statistic #230

Closed

Genome-wide selection scans #232

Closed

eric-czech mentioned this issue Sep 25, 2020

Port function for calculating genomic windows #281

Closed

tomwhite mentioned this issue Oct 5, 2020

Fst windowing #303

Merged

jeromekelleher mentioned this issue Oct 21, 2020

Flexible specification of windows #341

Closed

tomwhite mentioned this issue Nov 24, 2020

Add docs for windowing and cohorts #404

Merged

tomwhite closed this as completed Nov 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windowing along the genome #229

Windowing along the genome #229

jeromekelleher commented Sep 2, 2020 •

edited

Loading

jeromekelleher commented Sep 2, 2020 •

edited

Loading

eric-czech commented Sep 2, 2020

eric-czech commented Sep 2, 2020 •

edited

Loading

tomwhite commented Sep 3, 2020

jeromekelleher commented Sep 3, 2020 •

edited

Loading

tomwhite commented Sep 29, 2020

jeromekelleher commented Sep 29, 2020

tomwhite commented Oct 1, 2020

jeromekelleher commented Oct 1, 2020

tomwhite commented Nov 30, 2020

Windowing along the genome #229

Windowing along the genome #229

Comments

jeromekelleher commented Sep 2, 2020 • edited Loading

Add a "windows" argument to all functions

Store stats and window afterwards

jeromekelleher commented Sep 2, 2020 • edited Loading

eric-czech commented Sep 2, 2020

eric-czech commented Sep 2, 2020 • edited Loading

tomwhite commented Sep 3, 2020

jeromekelleher commented Sep 3, 2020 • edited Loading

tomwhite commented Sep 29, 2020

jeromekelleher commented Sep 29, 2020

tomwhite commented Oct 1, 2020

jeromekelleher commented Oct 1, 2020

tomwhite commented Nov 30, 2020

jeromekelleher commented Sep 2, 2020 •

edited

Loading

jeromekelleher commented Sep 2, 2020 •

edited

Loading

eric-czech commented Sep 2, 2020 •

edited

Loading

jeromekelleher commented Sep 3, 2020 •

edited

Loading