Fst windowing #303

tomwhite · 2020-10-05T14:36:47Z

This is a proposal for #229, and is also related to #281.

Adds a new window function that adds windowing information to a dataset (a windows dimension). This just adds fixed-size windows at the moment, and doesn't take contigs into account - both could be improved in the future.
Adds a window_statistic function (similar scikit-allel's moving_statistic) that computes a statistic for each window (using Dask's map_overlap function) for use by method functions (i.e. this is not a user-facing function).
Updates diversity, divergence, and Fst to use windows if present.

Sample usage:

>>> ds = window(ds, size=25, step=25)
>>> fst_ds = Fst(ds, estimator="Nei")
>>> fst_ds
<xarray.Dataset>
Dimensions:              (alleles: 2, cohorts: 2, cohorts_0: 2, cohorts_1: 2, ploidy: 1, samples: 10, variants: 98, windows: 4)
Coordinates:
  * cohorts_0            (cohorts_0) object 'co_0' 'co_1'
  * cohorts_1            (cohorts_1) object 'co_0' 'co_1'
Dimensions without coordinates: alleles, cohorts, ploidy, samples, variants, windows
Data variables:
    stat_Fst             (windows, cohorts_0, cohorts_1) float64 dask.array<chunksize=(4, 2, 2), meta=np.ndarray>
    cohort_allele_count  (variants, cohorts, alleles) int64 dask.array<chunksize=(98, 2, 2), meta=np.ndarray>
    call_allele_count    (variants, samples, alleles) uint8 dask.array<chunksize=(98, 10, 2), meta=np.ndarray>
    window_start         (windows) int64 0 25 50 75
    window_stop          (windows) int64 25 50 75 98
    variant_contig       (variants) int64 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
    variant_position     (variants) int64 0 2 6 8 12 19 ... 193 193 194 196 197
    variant_allele       (variants, alleles) |S1 b'0' b'1' b'0' ... b'0' b'1'
    sample_id            (samples) <U5 'tsk_0' 'tsk_1' ... 'tsk_8' 'tsk_9'
    call_genotype        (variants, samples, ploidy) int8 0 0 0 1 1 ... 1 0 0 1
    call_genotype_mask   (variants, samples, ploidy) bool False False ... False
    sample_cohort        (samples) int32 0 0 0 0 0 1 1 1 1 1
>>> fst = fst_ds["stat_Fst"].values
>>> fst
[[[       nan 0.04802022]
  [0.04802022        nan]]

 [[       nan 0.05591201]
  [0.05591201        nan]]

 [[       nan 0.03881701]
  [0.03881701        nan]]

 [[       nan 0.02708124]
  [0.02708124        nan]]]

Note the window_start and window_stop variables of dimension windows, and the stat_Fst variable whose first dimension is windows.

tomwhite · 2020-10-05T14:38:44Z

BTW this is based on #302, which should be merged first.

alimanfoo · 2020-10-08T16:27:00Z

Hi @tomwhite, one small thought is that I'll often try computing some statistic with a variety of different window sizes. So we might need some way to specify a name or prefix for each set of windows being added to the same dataset.

tomwhite · 2020-10-13T09:24:22Z

@alimanfoo that's good to know. Another way of achieving this would be to set merge=False to get a dataset for each set of windows.

codecov-io · 2020-10-13T09:25:57Z

Codecov Report

Merging #303 into master will decrease coverage by 0.06%.
The diff coverage is 97.53%.

@@            Coverage Diff             @@
##           master     #303      +/-   ##
==========================================
- Coverage   96.53%   96.46%   -0.07%     
==========================================
  Files          28       29       +1     
  Lines        1962     2039      +77     
==========================================
+ Hits         1894     1967      +73     
- Misses         68       72       +4

Impacted Files	Coverage Δ
sgkit/window.py	`97.01% <97.01%> (ø)`
sgkit/stats/popgen.py	`66.40% <100.00%> (+0.80%)`	⬆️
sgkit/variables.py	`96.36% <100.00%> (+0.06%)`	⬆️
sgkit/stats/hwe.py	`97.50% <0.00%> (-2.50%)`	⬇️
sgkit/testing.py	`95.00% <0.00%> (ø)`
sgkit/io/vcfzarr_reader.py	`100.00% <0.00%> (ø)`
sgkit/stats/aggregation.py	`89.15% <0.00%> (ø)`
sgkit/io/plink/plink_reader.py	`100.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6160f06...0c5a3f4. Read the comment docs.

ravwojdyla

I like the idea of storing the window information in the Dataset!

ravwojdyla · 2020-10-15T09:42:55Z

sgkit/window.py

+    return conditional_merge_datasets(ds, new_ds, merge)
+
+
+def _get_windows(start: int, stop: int, size: int, step: int) -> Any:


Why is this Any?

Changed to ArrayType.

ravwojdyla · 2020-10-15T09:44:54Z

sgkit/window.py

+    )
+
+
+def window_statistic(


Is there any better way to type this to avoid some many Anys? (this is also overall comment, there is a lot of types avoided via Any, I understand this is "internal" code).

This was from when I was prototyping, and I forgot to fix them before submitting. I have now eliminated them all except kwargs and block_info, which are harder to provide hints for.

ravwojdyla · 2020-10-15T10:07:26Z

sgkit/stats/popgen.py

+            ds.window_start.values,
+            ds.window_stop.values,


ds.window_{start, stop} is never lazy right?

No, not at the moment at least.

Should it be? The loose guideline I have in my head is that anything that doesn't run interactively (<1s or so) across a whole genome dataset (~100M variants) should probably be lazy. Is it worth adding an issue to make this lazy or test that the latter is true?

Opened #340

ravwojdyla · 2020-10-15T10:11:31Z

sgkit/tests/test_popgen.py

    ts_div = ts.diversity(span_normalise=False)
    np.testing.assert_allclose(div, ts_div)


+@pytest.mark.parametrize("size", [10])


size is rather generic, can be so many different things in this context, maybe a better name would be sample_size? (also below)

Good point - especially as we now have sample size and window size in the same test.

ravwojdyla · 2020-10-15T10:23:34Z

sgkit/tests/test_popgen.py

+    positions = [variant.site.position for variant in ts.variants()]
+    windows = [0] + positions[::25][1:] + [ts.sequence_length]
+    ts_div = ts.diversity(windows=windows, span_normalise=False)
+    np.testing.assert_allclose(div, ts_div)


Probably a bit overarching question, not necessary this PR, but since we heavily depend on dask, I wonder if we should use more of their dask.array.utils.assert_eq (which under the hood calls allclose), which afaiu would handle things like calling compute etc which would potentially make this more "array" idiomatic/agnostic.

Good idea. I opened #332

ravwojdyla · 2020-10-15T10:36:13Z

sgkit/window.py

+    statistic: Callable[..., Any],
+    size: int,
+    step: int,
+    dtype: int,


This probably use use typing.DType?

ravwojdyla · 2020-10-15T10:48:39Z

sgkit/window.py

+        # None means don't rechunk along that axis
+        new_chunks = tuple(list([new_chunk0] + ([None] * (len(chunk0) - 1))))  # type: ignore
+        values = values.rechunk(new_chunks)


afaiu you could probably just say: values = values.rechunk({0: new_chunk0}) or sth akin to that.

Thanks - done

ravwojdyla · 2020-10-15T11:00:45Z

sgkit/window.py

+        new_chunks = (tuple(windows_per_chunk),)
+    else:
+        # depth is 0 except in first axis
+        depth = tuple([depth] + ([0] * (values.ndim - 1)))


I wonder if map_overlap would work in this case with depth = {0: depth} (without the need to specify the depth on the other dimensions) 🤔

That should work: https://github.com/dask/dask/blob/master/dask/array/overlap.py#L732 is the relevant bit for that (all others default to 0).

Thanks for checking - fixed

ravwojdyla · 2020-10-15T11:04:48Z

sgkit/tests/test_window.py

+    values_sa = arr
+    stat_sa = allel.moving_statistic(values_sa, sum_cols, size=size, step=step)
+
+    np.testing.assert_equal(stat, stat_sa)


Would it make sense to add more tests here, there are some interesting functions in the window package, like _get_chunked_windows? Plus test for corner cases like empty arrays etc. And maybe also a test for what happens when you window multiple times with merge=True. And maybe a test case for https://github.com/pystatgen/sgkit/pull/303#issuecomment-705682385

I agree, it'd be good to get some tests in here that are just for the window function itself.

Added tests for various window functions in pystatgen/sgkit@d57aefd

ravwojdyla · 2020-10-15T11:23:15Z

sgkit/window.py

+# Window definition (user code)
+
+
+def window(


Thinking about the API, I wonder if we should have a simple function to "remove" the windowing from a dataset? Otherwise we might expect people to pop window_start/window_stop vars?

Yes, that would be useful. Opened #333

tomwhite · 2020-10-15T13:16:21Z

Thanks for the review @ravwojdyla! I will fix your suggestions given that the overall approach seems to be OK.

eric-czech · 2020-10-15T17:19:55Z

sgkit/window.py

+
+def _get_windows(start: int, stop: int, size: int, step: int) -> Any:
+    # Find the indexes for the start positions of all windows
+    # TODO: take contigs into account


Worth elevating as an issue? I'm sure it will come up again in the context of LD estimation.

Opened #335

eric-czech · 2020-10-15T17:31:59Z

sgkit/tests/test_window.py

+    stat_sa = allel.moving_statistic(values_sa, np.sum, size=size, step=step)
+
+    np.testing.assert_equal(stat, stat_sa)
+


I think a test for dtype preservation when provided float32 inputs would be good for these kinds of functions. I wouldn't say it's critical but PCA and the pairwise metrics will have that behavior so it would be good here too for consistency.

I added an assert to check for dtype preservation here.

… remove for single-window tests).

…ttps://github.com/pystatgen/sgkit/issues/335

tomwhite · 2020-10-19T16:48:15Z

I've addressed all the feedback, except for the comment about adding more tests for the window module. I'll try to write some more, but otherwise this is probably ready to be merged.

jeromekelleher

This looks good, but I worry that we're painting ourselves into a bit of a corner with the names of things here, given that this is so tied to variant indexes. How will we deal with windowing in physical distance? I think we should have a good idea how this is going to work and how we're going to name things before finalising this API.

sgkit/tests/test_popgen.py

jeromekelleher · 2020-10-20T09:59:36Z

sgkit/tests/test_window.py

+    values_sa = arr
+    stat_sa = allel.moving_statistic(values_sa, sum_cols, size=size, step=step)
+
+    np.testing.assert_equal(stat, stat_sa)


I agree, it'd be good to get some tests in here that are just for the window function itself.

jeromekelleher · 2020-10-20T10:35:55Z

sgkit/window.py

+    ----------
+    ds
+        Genotype call dataset.
+    size


Hmm, I'm not so sure about this API, it feels a bit inflexible. People are going to want to window and plot in terms of physical distance and not just in terms of the number of variants (see the proposal in #229). I see that using the number of variants aligns much more nicely with chunking. Is this a fundamental restriction - will we always need windows defined in this way to have equal numbers of variants? If we did want to do things like "compute Fst in 100Kb windows", is this incompatible with the way we're doing things in Dask?

It's not obvious what the utility of the step parameter is either - I'd imagine the vast majority of people will want contiguous windows (unless I'm misunderstanding what this does?). It should at least have a default such that users do get contiguous windows and don't need to specify this value all the time.

I agree with you. I think we do want something along the lines sketched out in #229, but this was a first step, and I had imagined that it would evolve. Note that the underlying idea of having arrays of window start and stop positions in the implementation is very flexible (since the windows can be arbitrary sizes and at arbitrary positions), it's just that the way users specify them with this API is very constrained.

Regarding the step parameter, I was following scikit-allel, and implementing just enough to be able to run PBS and H statistics (since sometimes step != size for those). Defaulting step to size is a good idea.

Perhaps we could have a single windows argument that is overloaded to accept either a size, a string (for physical distance), or a windows specification object that could have things like step and size in it. Something like:

def window(ds: Dataset, windows: Union[int, str, WindowsSpec], merge: bool = True) -> Dataset: ...

But I was hoping we could do this in another PR.

I really like this idea @tomwhite - how about we create a follow up issue for this so, and get this much merged?

jeromekelleher · 2020-10-20T17:12:02Z

@ravwojdyla - have the changes you requested been made?

jeromekelleher · 2020-10-20T18:58:22Z

Looks great @tomwhite - please add the auto-merge flag when you're happy to merge this (I guess with an issue flagging the current limitations as discussed above)

tomwhite mentioned this pull request Oct 5, 2020

Port function for calculating genomic windows #281

Closed

tomwhite requested a review from ravwojdyla October 12, 2020 09:20

tomwhite force-pushed the fst-window branch from 45e8328 to 51b50ed Compare October 13, 2020 09:18

tomwhite force-pushed the fst-window branch from 51b50ed to ebfec5e Compare October 13, 2020 13:36

tomwhite marked this pull request as ready for review October 15, 2020 08:11

ravwojdyla requested changes Oct 15, 2020

View reviewed changes

eric-czech reviewed Oct 15, 2020

View reviewed changes

This was referenced Oct 19, 2020

Use dask.array.utils.assert_eq in tests #332

Open

Add function to remove windowing from a dataset #333

Open

tomwhite added 7 commits October 19, 2020 17:08

Windowing functionality

420da80

Diversity using dataset windows

72358d2

Divergence using dataset windows

f23ac1d

Fst using dataset windows

a0dd4c3

Tajimas_D using dataset windows

47d042f

Fix naming

7aa3513

Only test chunking where chunk size is greater than window size (i.e.…

2fda1fd

… remove for single-window tests).

tomwhite mentioned this pull request Oct 19, 2020

Take contigs into account for fixed-size windowing #335

Closed

tomwhite added 7 commits October 19, 2020 17:42

Replace Any with more restrictive types

7e22bb5

Rename 'size' to 'sample_size'

5be490f

Mark failing divergence test (comparison w skallel) as xfail

718c9a3

Simplify rechunking call

7eb61e5

Add documentation to window function

3773658

Add reference to 'Take contigs into account for fixed-size windowing' h…

7f99d15

…ttps://github.com/pystatgen/sgkit/issues/335

Test moving_statistic preserves dtype

1d2d878

Simplify specification of depth

f6a5b09

tomwhite force-pushed the fst-window branch from ebfec5e to f6a5b09 Compare October 19, 2020 16:46

eric-czech approved these changes Oct 19, 2020

View reviewed changes

Use sgkit.variables for window variable names

231328a

jeromekelleher reviewed Oct 20, 2020

View reviewed changes

tomwhite added 2 commits October 20, 2020 12:58

Add tests for windowing

d57aefd

Get positions from tskit tables directly (in tests)

0c5a3f4

tomwhite mentioned this pull request Oct 20, 2020

Windowing functions should be lazy #340

Closed

ravwojdyla approved these changes Oct 20, 2020

View reviewed changes

jeromekelleher approved these changes Oct 20, 2020

View reviewed changes

tomwhite mentioned this pull request Oct 21, 2020

Flexible specification of windows #341

Closed

tomwhite added the auto-merge Auto merge label for mergify test flight label Oct 21, 2020

Merge branch 'master' into fst-window

e28d0c6

mergify bot merged commit 8467a4d into sgkit-dev:master Oct 21, 2020

tomwhite mentioned this pull request Nov 5, 2020

Update popgen stats to compute for all variants #238

Closed

tomwhite mentioned this pull request Nov 17, 2020

Make window step default to window size #389

Merged

tomwhite mentioned this pull request Nov 30, 2020

Windowing along the genome #229

Closed

		return conditional_merge_datasets(ds, new_ds, merge)


		def _get_windows(start: int, stop: int, size: int, step: int) -> Any:

		stat_sa = allel.moving_statistic(values_sa, np.sum, size=size, step=step)

		np.testing.assert_equal(stat, stat_sa)

Fst windowing #303

Fst windowing #303

Conversation

tomwhite commented Oct 5, 2020

tomwhite commented Oct 5, 2020

alimanfoo commented Oct 8, 2020

tomwhite commented Oct 13, 2020

codecov-io commented Oct 13, 2020 • edited Loading

Codecov Report

ravwojdyla left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ravwojdyla Oct 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomwhite commented Oct 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-czech Oct 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomwhite commented Oct 19, 2020

jeromekelleher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromekelleher commented Oct 20, 2020

jeromekelleher commented Oct 20, 2020

codecov-io commented Oct 13, 2020 •

edited

Loading

ravwojdyla Oct 15, 2020 •

edited

Loading

eric-czech Oct 15, 2020 •

edited

Loading