-
Notifications
You must be signed in to change notification settings - Fork 35
Fst windowing #303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fst windowing #303
Conversation
BTW this is based on #302, which should be merged first. |
Hi @tomwhite, one small thought is that I'll often try computing some statistic with a variety of different window sizes. So we might need some way to specify a name or prefix for each set of windows being added to the same dataset. |
@alimanfoo that's good to know. Another way of achieving this would be to set |
Codecov Report
@@ Coverage Diff @@
## master #303 +/- ##
==========================================
- Coverage 96.53% 96.46% -0.07%
==========================================
Files 28 29 +1
Lines 1962 2039 +77
==========================================
+ Hits 1894 1967 +73
- Misses 68 72 +4
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of storing the window information in the Dataset
!
sgkit/window.py
Outdated
return conditional_merge_datasets(ds, new_ds, merge) | ||
|
||
|
||
def _get_windows(start: int, stop: int, size: int, step: int) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this Any
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to ArrayType
.
) | ||
|
||
|
||
def window_statistic( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any better way to type this to avoid some many Any
s? (this is also overall comment, there is a lot of types avoided via Any
, I understand this is "internal" code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was from when I was prototyping, and I forgot to fix them before submitting. I have now eliminated them all except kwargs
and block_info
, which are harder to provide hints for.
ds.window_start.values, | ||
ds.window_stop.values, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ds.window_{start, stop}
is never lazy right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not at the moment at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be? The loose guideline I have in my head is that anything that doesn't run interactively (<1s or so) across a whole genome dataset (~100M variants) should probably be lazy. Is it worth adding an issue to make this lazy or test that the latter is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #340
sgkit/tests/test_popgen.py
Outdated
ts_div = ts.diversity(span_normalise=False) | ||
np.testing.assert_allclose(div, ts_div) | ||
|
||
|
||
@pytest.mark.parametrize("size", [10]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size
is rather generic, can be so many different things in this context, maybe a better name would be sample_size
? (also below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - especially as we now have sample size and window size in the same test.
positions = [variant.site.position for variant in ts.variants()] | ||
windows = [0] + positions[::25][1:] + [ts.sequence_length] | ||
ts_div = ts.diversity(windows=windows, span_normalise=False) | ||
np.testing.assert_allclose(div, ts_div) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a bit overarching question, not necessary this PR, but since we heavily depend on dask
, I wonder if we should use more of their dask.array.utils.assert_eq
(which under the hood calls allclose
), which afaiu would handle things like calling compute
etc which would potentially make this more "array" idiomatic/agnostic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I opened #332
sgkit/window.py
Outdated
statistic: Callable[..., Any], | ||
size: int, | ||
step: int, | ||
dtype: int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably use use typing.DType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
sgkit/window.py
Outdated
# None means don't rechunk along that axis | ||
new_chunks = tuple(list([new_chunk0] + ([None] * (len(chunk0) - 1)))) # type: ignore | ||
values = values.rechunk(new_chunks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaiu you could probably just say: values = values.rechunk({0: new_chunk0})
or sth akin to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - done
sgkit/window.py
Outdated
new_chunks = (tuple(windows_per_chunk),) | ||
else: | ||
# depth is 0 except in first axis | ||
depth = tuple([depth] + ([0] * (values.ndim - 1))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if map_overlap
would work in this case with depth = {0: depth}
(without the need to specify the depth on the other dimensions) 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should work: https://github.com/dask/dask/blob/master/dask/array/overlap.py#L732 is the relevant bit for that (all others default to 0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking - fixed
values_sa = arr | ||
stat_sa = allel.moving_statistic(values_sa, sum_cols, size=size, step=step) | ||
|
||
np.testing.assert_equal(stat, stat_sa) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to add more tests here, there are some interesting functions in the window
package, like _get_chunked_windows
? Plus test for corner cases like empty arrays etc. And maybe also a test for what happens when you window multiple times with merge=True
. And maybe a test case for https://github.com/pystatgen/sgkit/pull/303#issuecomment-705682385
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it'd be good to get some tests in here that are just for the window
function itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added tests for various window functions in pystatgen/sgkit@d57aefd
# Window definition (user code) | ||
|
||
|
||
def window( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about the API, I wonder if we should have a simple function to "remove" the windowing from a dataset? Otherwise we might expect people to pop window_start
/window_stop
vars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be useful. Opened #333
Thanks for the review @ravwojdyla! I will fix your suggestions given that the overall approach seems to be OK. |
sgkit/window.py
Outdated
|
||
def _get_windows(start: int, stop: int, size: int, step: int) -> Any: | ||
# Find the indexes for the start positions of all windows | ||
# TODO: take contigs into account |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth elevating as an issue? I'm sure it will come up again in the context of LD estimation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #335
stat_sa = allel.moving_statistic(values_sa, np.sum, size=size, step=step) | ||
|
||
np.testing.assert_equal(stat, stat_sa) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a test for dtype preservation when provided float32 inputs would be good for these kinds of functions. I wouldn't say it's critical but PCA and the pairwise metrics will have that behavior so it would be good here too for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an assert to check for dtype preservation here.
… remove for single-window tests).
I've addressed all the feedback, except for the comment about adding more tests for the window module. I'll try to write some more, but otherwise this is probably ready to be merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, but I worry that we're painting ourselves into a bit of a corner with the names of things here, given that this is so tied to variant indexes. How will we deal with windowing in physical distance? I think we should have a good idea how this is going to work and how we're going to name things before finalising this API.
values_sa = arr | ||
stat_sa = allel.moving_statistic(values_sa, sum_cols, size=size, step=step) | ||
|
||
np.testing.assert_equal(stat, stat_sa) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it'd be good to get some tests in here that are just for the window
function itself.
---------- | ||
ds | ||
Genotype call dataset. | ||
size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm not so sure about this API, it feels a bit inflexible. People are going to want to window and plot in terms of physical distance and not just in terms of the number of variants (see the proposal in #229). I see that using the number of variants aligns much more nicely with chunking. Is this a fundamental restriction - will we always need windows defined in this way to have equal numbers of variants? If we did want to do things like "compute Fst in 100Kb windows", is this incompatible with the way we're doing things in Dask?
It's not obvious what the utility of the step
parameter is either - I'd imagine the vast majority of people will want contiguous windows (unless I'm misunderstanding what this does?). It should at least have a default such that users do get contiguous windows and don't need to specify this value all the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. I think we do want something along the lines sketched out in #229, but this was a first step, and I had imagined that it would evolve. Note that the underlying idea of having arrays of window start and stop positions in the implementation is very flexible (since the windows can be arbitrary sizes and at arbitrary positions), it's just that the way users specify them with this API is very constrained.
Regarding the step
parameter, I was following scikit-allel, and implementing just enough to be able to run PBS and H statistics (since sometimes step
!= size
for those). Defaulting step
to size
is a good idea.
Perhaps we could have a single windows
argument that is overloaded to accept either a size, a string (for physical distance), or a windows specification object that could have things like step and size in it. Something like:
def window(ds: Dataset, windows: Union[int, str, WindowsSpec], merge: bool = True) -> Dataset:
...
But I was hoping we could do this in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this idea @tomwhite - how about we create a follow up issue for this so, and get this much merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #341
@ravwojdyla - have the changes you requested been made? |
Looks great @tomwhite - please add the auto-merge flag when you're happy to merge this (I guess with an issue flagging the current limitations as discussed above) |
This is a proposal for #229, and is also related to #281.
window
function that adds windowing information to a dataset (awindows
dimension). This just adds fixed-size windows at the moment, and doesn't take contigs into account - both could be improved in the future.window_statistic
function (similar scikit-allel'smoving_statistic
) that computes a statistic for each window (using Dask'smap_overlap
function) for use by method functions (i.e. this is not a user-facing function).diversity
,divergence
, andFst
to use windows if present.Sample usage:
Note the
window_start
andwindow_stop
variables of dimensionwindows
, and thestat_Fst
variable whose first dimension iswindows
.