Skip to content

Fst windowing #303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Oct 21, 2020
Merged

Fst windowing #303

merged 19 commits into from
Oct 21, 2020

Conversation

tomwhite
Copy link
Collaborator

@tomwhite tomwhite commented Oct 5, 2020

This is a proposal for #229, and is also related to #281.

  • Adds a new window function that adds windowing information to a dataset (a windows dimension). This just adds fixed-size windows at the moment, and doesn't take contigs into account - both could be improved in the future.
  • Adds a window_statistic function (similar scikit-allel's moving_statistic) that computes a statistic for each window (using Dask's map_overlap function) for use by method functions (i.e. this is not a user-facing function).
  • Updates diversity, divergence, and Fst to use windows if present.

Sample usage:

>>> ds = window(ds, size=25, step=25)
>>> fst_ds = Fst(ds, estimator="Nei")
>>> fst_ds
<xarray.Dataset>
Dimensions:              (alleles: 2, cohorts: 2, cohorts_0: 2, cohorts_1: 2, ploidy: 1, samples: 10, variants: 98, windows: 4)
Coordinates:
  * cohorts_0            (cohorts_0) object 'co_0' 'co_1'
  * cohorts_1            (cohorts_1) object 'co_0' 'co_1'
Dimensions without coordinates: alleles, cohorts, ploidy, samples, variants, windows
Data variables:
    stat_Fst             (windows, cohorts_0, cohorts_1) float64 dask.array<chunksize=(4, 2, 2), meta=np.ndarray>
    cohort_allele_count  (variants, cohorts, alleles) int64 dask.array<chunksize=(98, 2, 2), meta=np.ndarray>
    call_allele_count    (variants, samples, alleles) uint8 dask.array<chunksize=(98, 10, 2), meta=np.ndarray>
    window_start         (windows) int64 0 25 50 75
    window_stop          (windows) int64 25 50 75 98
    variant_contig       (variants) int64 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
    variant_position     (variants) int64 0 2 6 8 12 19 ... 193 193 194 196 197
    variant_allele       (variants, alleles) |S1 b'0' b'1' b'0' ... b'0' b'1'
    sample_id            (samples) <U5 'tsk_0' 'tsk_1' ... 'tsk_8' 'tsk_9'
    call_genotype        (variants, samples, ploidy) int8 0 0 0 1 1 ... 1 0 0 1
    call_genotype_mask   (variants, samples, ploidy) bool False False ... False
    sample_cohort        (samples) int32 0 0 0 0 0 1 1 1 1 1
>>> fst = fst_ds["stat_Fst"].values
>>> fst
[[[       nan 0.04802022]
  [0.04802022        nan]]

 [[       nan 0.05591201]
  [0.05591201        nan]]

 [[       nan 0.03881701]
  [0.03881701        nan]]

 [[       nan 0.02708124]
  [0.02708124        nan]]]

Note the window_start and window_stop variables of dimension windows, and the stat_Fst variable whose first dimension is windows.

@tomwhite
Copy link
Collaborator Author

tomwhite commented Oct 5, 2020

BTW this is based on #302, which should be merged first.

@alimanfoo
Copy link
Collaborator

Hi @tomwhite, one small thought is that I'll often try computing some statistic with a variety of different window sizes. So we might need some way to specify a name or prefix for each set of windows being added to the same dataset.

@tomwhite
Copy link
Collaborator Author

@alimanfoo that's good to know. Another way of achieving this would be to set merge=False to get a dataset for each set of windows.

@codecov-io
Copy link

codecov-io commented Oct 13, 2020

Codecov Report

Merging #303 into master will decrease coverage by 0.06%.
The diff coverage is 97.53%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #303      +/-   ##
==========================================
- Coverage   96.53%   96.46%   -0.07%     
==========================================
  Files          28       29       +1     
  Lines        1962     2039      +77     
==========================================
+ Hits         1894     1967      +73     
- Misses         68       72       +4     
Impacted Files Coverage Δ
sgkit/window.py 97.01% <97.01%> (ø)
sgkit/stats/popgen.py 66.40% <100.00%> (+0.80%) ⬆️
sgkit/variables.py 96.36% <100.00%> (+0.06%) ⬆️
sgkit/stats/hwe.py 97.50% <0.00%> (-2.50%) ⬇️
sgkit/testing.py 95.00% <0.00%> (ø)
sgkit/io/vcfzarr_reader.py 100.00% <0.00%> (ø)
sgkit/stats/aggregation.py 89.15% <0.00%> (ø)
sgkit/io/plink/plink_reader.py 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6160f06...0c5a3f4. Read the comment docs.

@tomwhite tomwhite marked this pull request as ready for review October 15, 2020 08:11
Copy link
Collaborator

@ravwojdyla ravwojdyla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of storing the window information in the Dataset!

sgkit/window.py Outdated
return conditional_merge_datasets(ds, new_ds, merge)


def _get_windows(start: int, stop: int, size: int, step: int) -> Any:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this Any?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to ArrayType.

)


def window_statistic(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any better way to type this to avoid some many Anys? (this is also overall comment, there is a lot of types avoided via Any, I understand this is "internal" code).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was from when I was prototyping, and I forgot to fix them before submitting. I have now eliminated them all except kwargs and block_info, which are harder to provide hints for.

Comment on lines +77 to +78
ds.window_start.values,
ds.window_stop.values,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ds.window_{start, stop} is never lazy right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not at the moment at least.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be? The loose guideline I have in my head is that anything that doesn't run interactively (<1s or so) across a whole genome dataset (~100M variants) should probably be lazy. Is it worth adding an issue to make this lazy or test that the latter is true?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #340

ts_div = ts.diversity(span_normalise=False)
np.testing.assert_allclose(div, ts_div)


@pytest.mark.parametrize("size", [10])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size is rather generic, can be so many different things in this context, maybe a better name would be sample_size? (also below)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - especially as we now have sample size and window size in the same test.

positions = [variant.site.position for variant in ts.variants()]
windows = [0] + positions[::25][1:] + [ts.sequence_length]
ts_div = ts.diversity(windows=windows, span_normalise=False)
np.testing.assert_allclose(div, ts_div)
Copy link
Collaborator

@ravwojdyla ravwojdyla Oct 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a bit overarching question, not necessary this PR, but since we heavily depend on dask, I wonder if we should use more of their dask.array.utils.assert_eq (which under the hood calls allclose), which afaiu would handle things like calling compute etc which would potentially make this more "array" idiomatic/agnostic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I opened #332

sgkit/window.py Outdated
statistic: Callable[..., Any],
size: int,
step: int,
dtype: int,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably use use typing.DType?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

sgkit/window.py Outdated
Comment on lines 102 to 104
# None means don't rechunk along that axis
new_chunks = tuple(list([new_chunk0] + ([None] * (len(chunk0) - 1)))) # type: ignore
values = values.rechunk(new_chunks)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaiu you could probably just say: values = values.rechunk({0: new_chunk0}) or sth akin to that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - done

sgkit/window.py Outdated
new_chunks = (tuple(windows_per_chunk),)
else:
# depth is 0 except in first axis
depth = tuple([depth] + ([0] * (values.ndim - 1)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if map_overlap would work in this case with depth = {0: depth} (without the need to specify the depth on the other dimensions) 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should work: https://github.com/dask/dask/blob/master/dask/array/overlap.py#L732 is the relevant bit for that (all others default to 0).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking - fixed

values_sa = arr
stat_sa = allel.moving_statistic(values_sa, sum_cols, size=size, step=step)

np.testing.assert_equal(stat, stat_sa)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to add more tests here, there are some interesting functions in the window package, like _get_chunked_windows? Plus test for corner cases like empty arrays etc. And maybe also a test for what happens when you window multiple times with merge=True. And maybe a test case for https://github.com/pystatgen/sgkit/pull/303#issuecomment-705682385

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it'd be good to get some tests in here that are just for the window function itself.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests for various window functions in pystatgen/sgkit@d57aefd

# Window definition (user code)


def window(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about the API, I wonder if we should have a simple function to "remove" the windowing from a dataset? Otherwise we might expect people to pop window_start/window_stop vars?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be useful. Opened #333

@tomwhite
Copy link
Collaborator Author

Thanks for the review @ravwojdyla! I will fix your suggestions given that the overall approach seems to be OK.

sgkit/window.py Outdated

def _get_windows(start: int, stop: int, size: int, step: int) -> Any:
# Find the indexes for the start positions of all windows
# TODO: take contigs into account
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth elevating as an issue? I'm sure it will come up again in the context of LD estimation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #335

stat_sa = allel.moving_statistic(values_sa, np.sum, size=size, step=step)

np.testing.assert_equal(stat, stat_sa)

Copy link
Collaborator

@eric-czech eric-czech Oct 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a test for dtype preservation when provided float32 inputs would be good for these kinds of functions. I wouldn't say it's critical but PCA and the pairwise metrics will have that behavior so it would be good here too for consistency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an assert to check for dtype preservation here.

@tomwhite
Copy link
Collaborator Author

I've addressed all the feedback, except for the comment about adding more tests for the window module. I'll try to write some more, but otherwise this is probably ready to be merged.

Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but I worry that we're painting ourselves into a bit of a corner with the names of things here, given that this is so tied to variant indexes. How will we deal with windowing in physical distance? I think we should have a good idea how this is going to work and how we're going to name things before finalising this API.

values_sa = arr
stat_sa = allel.moving_statistic(values_sa, sum_cols, size=size, step=step)

np.testing.assert_equal(stat, stat_sa)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it'd be good to get some tests in here that are just for the window function itself.

----------
ds
Genotype call dataset.
size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not so sure about this API, it feels a bit inflexible. People are going to want to window and plot in terms of physical distance and not just in terms of the number of variants (see the proposal in #229). I see that using the number of variants aligns much more nicely with chunking. Is this a fundamental restriction - will we always need windows defined in this way to have equal numbers of variants? If we did want to do things like "compute Fst in 100Kb windows", is this incompatible with the way we're doing things in Dask?

It's not obvious what the utility of the step parameter is either - I'd imagine the vast majority of people will want contiguous windows (unless I'm misunderstanding what this does?). It should at least have a default such that users do get contiguous windows and don't need to specify this value all the time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. I think we do want something along the lines sketched out in #229, but this was a first step, and I had imagined that it would evolve. Note that the underlying idea of having arrays of window start and stop positions in the implementation is very flexible (since the windows can be arbitrary sizes and at arbitrary positions), it's just that the way users specify them with this API is very constrained.

Regarding the step parameter, I was following scikit-allel, and implementing just enough to be able to run PBS and H statistics (since sometimes step != size for those). Defaulting step to size is a good idea.

Perhaps we could have a single windows argument that is overloaded to accept either a size, a string (for physical distance), or a windows specification object that could have things like step and size in it. Something like:

def window(ds: Dataset, windows: Union[int, str, WindowsSpec], merge: bool = True) -> Dataset:
  ...

But I was hoping we could do this in another PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this idea @tomwhite - how about we create a follow up issue for this so, and get this much merged?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #341

@jeromekelleher
Copy link
Collaborator

@ravwojdyla - have the changes you requested been made?

@jeromekelleher
Copy link
Collaborator

Looks great @tomwhite - please add the auto-merge flag when you're happy to merge this (I guess with an issue flagging the current limitations as discussed above)

@tomwhite tomwhite added the auto-merge Auto merge label for mergify test flight label Oct 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants