Skip to content

Windowing along the genome #229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jeromekelleher opened this issue Sep 2, 2020 · 10 comments
Closed

Windowing along the genome #229

jeromekelleher opened this issue Sep 2, 2020 · 10 comments
Labels
core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc.

Comments

@jeromekelleher
Copy link
Collaborator

jeromekelleher commented Sep 2, 2020

Computing statistics in windows along the genome is a basic requirement. For example, users will often want to compute Fst values in (say) 100kb chunks along the genome. I think there's two basic approaches we could take here:

Add a "windows" argument to all functions

Add an argument windows to all of the functions in which this would make sense, which would follow a similar logic to how tskit does it. I think this is better than the approach taken in scikit-allel, where we have functions like windowed_patterson_fst.

To make things concrete, here's a rough idea of what Fst would look like:

def parse_windows(ds, windows)
      L = largest_variant_position(ds) + 1 # assuming we can get this
      if windows is None:
          # By default we have one window
           windows = [0, L]    
      elif isinstance(windows, str):
            # Obviously this is stupid and we'd parse out the suffix
            if windows == "100kb":
                step_size = 100_000
           else:
                 raise ValueError("Unrecognised window description")
           windows = np.arange(0, L, step=step_size)
     elif isinstance(windows, int):   # This is brittle and we'd probably do something better in reality
          # Interpret the argument as asking for n windows
          windows = np.linspace(0, L, num=windows)
     # Otherwise, assume that windows is a 1D array of n values describing n intervals along the genome.
     return windows
     

def Fst(ds, *, windows=None):
      windows = parse_windows
      # compute Fst in variants in these windows and return an array/scalar depending on the input.

So, in the most general case, windows is a 1D array of n coordinates describing n - 1 intervals along the genome. The intervals are half-open (left-inclusive, right-exclusive). We also provide some other handy ways of specifying common types of windows by interpreting different argument types differently. These bells and whistles probably aren't necessary initially, though.

Store stats and window afterwards

The previous formulation assumes that what we're returning is a single value or numpy array, but this is somewhat at odds with current thinking. In #103 we are assuming that we update the input data set with a variable for each thing that we compute on the dataset. Thus, an alternative way we might do things is that we compute the value of Fst for each variant and store the Fst array in the result dataset. Then, windowing is something that we do afterwards, by aggegating values. E.g.,

ds = sgkit.Fst(ds)  # ds now contains a "stats_Fst" variable, or similar
windowed_ds = sgkit.window_stats(ds, "100kb")

I guess this latter approach is a more flexible and idiomatic?

@jeromekelleher jeromekelleher added the core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. label Sep 2, 2020
This was referenced Sep 2, 2020
@jeromekelleher
Copy link
Collaborator Author

jeromekelleher commented Sep 2, 2020

We should also consider rolling windows, as this seems to be needed for things in PBS (#230)

@eric-czech
Copy link
Collaborator

FYI LD pruning (https://github.com/pystatgen/sgkit/issues/31#issuecomment-671324407) has similar requirements and the solution I had implemented for computing rolling base pair (or fixed) with some step size was to:

  1. Have some compiled function produce a dataset containing the start and stop variant indexes as well as values necessary for specifying uneven overlapping chunked computations
  2. Write custom dask delayed tasks that partition blocks based on the chunk windows implied above and feed the individual variant slice data to the computation on that chunk

Unfortunately, there's nothing in the Dask API for windows that aren't fixed in size so custom tasks are likely unavoidable.

@eric-czech
Copy link
Collaborator

eric-czech commented Sep 2, 2020

One approach to this that I think could generalize well in the context of Dask is to have a function similar to hail.locus_interval. We have the additional wrinkle of needing to worry about the chunking so I had made something like this:

# `window` = base pairs or numbers of variants
variant_intervals, chunk_intervals = api.axis_intervals(ds, window=100_000, unit='physical')
# `group` = chromosome
# `index` = center of window
# `start` = left side of window
# `stop` = right side of window
variant_intervals.to_dataset('var').to_dataframe().sample(10, random_state=1)

Screen Shot 2020-09-02 at 12 54 06 PM

The chunk interval data is similar, but also needed to be aware of the number of windows within a chunk in order to pre-allocate array results when using a GPU kernel to process a chunk.

Anyways, that could be a good way to orient the discussion for windowing since I think a specific function for it would be ideal rather than trying to bake it into individual methods.

+1 to something like windowed_ds = sgkit.window_stats(ds, "100kb") that would wrap up the chunking details for sure.

@tomwhite
Copy link
Collaborator

tomwhite commented Sep 3, 2020

So, in the most general case, windows is a 1D array of n coordinates describing n - 1 intervals along the genome.

Is there a non-contiguous case where some regions are ignored, or where windows overlap? In which case, you'd have 1D arrays of window_starts and window_ends (or window_sizes).

@jeromekelleher
Copy link
Collaborator Author

jeromekelleher commented Sep 3, 2020

It's a good point @tomwhite. What do we do with centromeres, say? Would it be better to mask these out afterwards?

I think for the plotting-things-along-the-genome case, contiguous intervals are what we want the vast majority of the time, but maybe we do want something more general.

@tomwhite
Copy link
Collaborator

an alternative way we might do things is that we compute the value of Fst for each variant and store the Fst array in the result dataset. Then, windowing is something that we do afterwards, by aggegating values.

I'm not sure this will work for Fst, since windowing per-variant Fst values is not the same as computing Fst on windowed diversity and divergence values. And I think the latter is what we actually want (please correct me if I'm wrong).

So while this would mean Fst (at least) would have to take a windows argument, it might still be useful to have a window_stats utility function for other uses.

@jeromekelleher
Copy link
Collaborator Author

Hmm, excellent point @tomwhite; you're right, that's what both tskit and scikit-allel are doing for Fst. So, really, there's no option but to have a windows argument there, as you say.

@tomwhite
Copy link
Collaborator

tomwhite commented Oct 1, 2020

Thinking about this more, I wonder if windows should be defined as an xarray Dataset variable (or actually a set of variables, for start, stop, etc). There would be a set of functions for creating windows: by fixed number of variants, genomic intervals, accessible bases, bed files - or even user defined ones. Then method functions (like diversity, divergence, Fst, ld_prune) would use a named window variable to do windowing. This would fit nicely into the way we are already doing things, since windows would just be another input variable to these method functions.

@jeromekelleher
Copy link
Collaborator Author

Nice idea @tomwhite, I like it! A big advantage of this is the that the coordinates go along with the values that are computed for plotting.

@tomwhite
Copy link
Collaborator

Closing this now that the basic mechanism is working and documented (#303, #404). There is follow-up work in #341.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc.
Projects
None yet
Development

No branches or pull requests

3 participants