-
Notifications
You must be signed in to change notification settings - Fork 35
Summary stats #102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Summary stats #102
Conversation
@daletovar, @jerowe I think this would be a great place to start getting some real code merged. Would either of you like to work with @tomwhite on getting this fleshed out? |
I've actually been struggling with that a good bit @tomwhite. I keep trying to think of a good verb for things like |
@jeromekelleher: @daletovar is out of town until Thursday but I think he'd be interested in this. |
To more of @tomwhite's original questions:
From an efficiency perspective, I think it would be ok to define upstream variables if not provided and then not merge/return them. I had thought Dask would be smart enough to know when the same computation is defined twice and in some experiments I haven't been able to find a counterexample, meaning that if we were to throw away the
I opened https://github.com/pystatgen/sgkit/issues/103 with some thoughts on that so we can keep that discussion going. |
@tomwhite, I'm happy to help with this. Would you prefer we keep this PR open and I fork your repo to collaborate? |
@daletovar thanks for picking this up. I don't have a strong opinion on whether you fork this, or start from scratch. |
) | ||
|
||
|
||
def genotype_count(ds: Dataset, dim: Dimension) -> Dataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this should be https://github.com/pystatgen/sgkit/issues/29#issuecomment-656691069 instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now fixed.
@tomwhite, I see. Well, do we know what all summary stats we want to have? |
690c4bf
to
0f17c53
Compare
After discussion with @daletovar I've picked this one up again. I've rebased on top of #217 (which is still a draft) so that the allele counting functions return datasets, and to use the code to merge datasets. I've also added docs, and a test for the case where |
6d62c8a
to
5e9abae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it took so long to approve this, but LGTM.
@Mergifyio rebase |
Command
|
9134a72
to
79bea64
Compare
@Mergifyio update |
Command
|
@Mergifyio update |
Command
|
The updates weren't working as we hadn't merged #244. Hopefully this will sort it out now. |
This is a draft implementation of #29 (based on @eric-czech's code there) for discussion. I'm quite happy for someone else to take if over if interested (e.g. @daletovar, @jerowe).
API docs need to be added still. Also sample summary stats.
A couple of points for discussion:
count_alleles
(verb), but some of the new methods don't easily fit into that pattern: e.g.call_rate
,variant_stats
. Perhaps insisting the function name for each method is a verb is going too far, and we should just allow the name that we think sounds best (this seems to be the approach taken by Hail and Glow which both have a mixture of both styles).allele_frequency
function callscount_alleles
, but perhaps it should check first if it has already been computed (I think @eric-czech does something similar in LD prune, for example). In that case, should it return thevariant_allele_count
variable in the returned dataset? I'm wondering whether we should actually merge the new variables with the original dataset as a general rule. The original dataset would be unchanged, and this would mean the user no longer had to call merge themselves, which might work better in pipelines, for example.