-
Notifications
You must be signed in to change notification settings - Fork 35
Windowing along the genome #229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We should also consider rolling windows, as this seems to be needed for things in PBS (#230) |
FYI LD pruning (https://github.com/pystatgen/sgkit/issues/31#issuecomment-671324407) has similar requirements and the solution I had implemented for computing rolling base pair (or fixed) with some step size was to:
Unfortunately, there's nothing in the Dask API for windows that aren't fixed in size so custom tasks are likely unavoidable. |
One approach to this that I think could generalize well in the context of Dask is to have a function similar to hail.locus_interval. We have the additional wrinkle of needing to worry about the chunking so I had made something like this: # `window` = base pairs or numbers of variants
variant_intervals, chunk_intervals = api.axis_intervals(ds, window=100_000, unit='physical')
# `group` = chromosome
# `index` = center of window
# `start` = left side of window
# `stop` = right side of window
variant_intervals.to_dataset('var').to_dataframe().sample(10, random_state=1) The chunk interval data is similar, but also needed to be aware of the number of windows within a chunk in order to pre-allocate array results when using a GPU kernel to process a chunk. Anyways, that could be a good way to orient the discussion for windowing since I think a specific function for it would be ideal rather than trying to bake it into individual methods. +1 to something like |
Is there a non-contiguous case where some regions are ignored, or where windows overlap? In which case, you'd have 1D arrays of |
It's a good point @tomwhite. What do we do with centromeres, say? Would it be better to mask these out afterwards? I think for the plotting-things-along-the-genome case, contiguous intervals are what we want the vast majority of the time, but maybe we do want something more general. |
I'm not sure this will work for Fst, since windowing per-variant Fst values is not the same as computing Fst on windowed diversity and divergence values. And I think the latter is what we actually want (please correct me if I'm wrong). So while this would mean |
Hmm, excellent point @tomwhite; you're right, that's what both tskit and scikit-allel are doing for Fst. So, really, there's no option but to have a |
Thinking about this more, I wonder if windows should be defined as an xarray Dataset variable (or actually a set of variables, for start, stop, etc). There would be a set of functions for creating windows: by fixed number of variants, genomic intervals, accessible bases, bed files - or even user defined ones. Then method functions (like diversity, divergence, Fst, ld_prune) would use a named window variable to do windowing. This would fit nicely into the way we are already doing things, since windows would just be another input variable to these method functions. |
Nice idea @tomwhite, I like it! A big advantage of this is the that the coordinates go along with the values that are computed for plotting. |
Computing statistics in windows along the genome is a basic requirement. For example, users will often want to compute Fst values in (say) 100kb chunks along the genome. I think there's two basic approaches we could take here:
Add a "windows" argument to all functions
Add an argument
windows
to all of the functions in which this would make sense, which would follow a similar logic to how tskit does it. I think this is better than the approach taken in scikit-allel, where we have functions likewindowed_patterson_fst
.To make things concrete, here's a rough idea of what Fst would look like:
So, in the most general case,
windows
is a 1D array of n coordinates describing n - 1 intervals along the genome. The intervals are half-open (left-inclusive, right-exclusive). We also provide some other handy ways of specifying common types of windows by interpreting different argument types differently. These bells and whistles probably aren't necessary initially, though.Store stats and window afterwards
The previous formulation assumes that what we're returning is a single value or numpy array, but this is somewhat at odds with current thinking. In #103 we are assuming that we update the input data set with a variable for each thing that we compute on the dataset. Thus, an alternative way we might do things is that we compute the value of
Fst
for each variant and store theFst
array in the result dataset. Then, windowing is something that we do afterwards, by aggegating values. E.g.,I guess this latter approach is a more flexible and idiomatic?
The text was updated successfully, but these errors were encountered: