-
Notifications
You must be signed in to change notification settings - Fork 35
Cohort statistics without numba #885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'd be inclined to take a "if it ain't broke don't fix it" view on this @timothymillar - after all, we depend pretty much entirely on numba, and I'm not sure the library will work in any meaningful way without it? The approach does look quite straightforward though, I agree. |
It would be interesting to see differences in timing, although as @jeromekelleher says numba is used throughout the codebase. Also, #803 is a long-term thing that will get fixed, and is not really a priority use case for us, just something to keep an eye on. And there is progress happening on that front too, see emscripten-forge/recipes#168. |
@tomwhite how does numba interact with |
Numba should work with Cubed by using That said, I'm not against this proposal if the performance is acceptable. |
Numba does cause issues with package compatibility - for example we can't use the latest NumPy version (1.23), which has implications for the version of cyvcf2 we can use... |
Some crude timings using # current implimentation in numba
%%timeit
cohort_sum(sample_values, sample_cohort, n_cohorts).compute()
# via one-hot using numpy
%%timeit
(sample_values[...,None] * cohort_membership[None,...]).sum(axis=1)
# via one-hot using xarray
%%timeit
(ds["sample_values"] * ds["cohort_membership"]).sum(dim="samples") Times in seconds:
So probably not an acceptable performance hit for anymore than a few cohorts. I thought it was at least worth documenting this approach as it may be useful elsewhere. Thanks for the discussion all! |
Very nice, thanks for looking into this @timothymillar! Numba is awesome 😍 |
I've been giving a bit more thought to how we can calculate cohort statistics (see #730) without using numba and realized that we could potentially use the same solution mentioned in https://github.com/pystatgen/sgkit/pull/114#issuecomment-678921281. That approach only uses generic array operations making it very portable (e.g. for #803). However, it does involve passing the data though a higher dimension array (
(..., samples)
->(..., samples, cohorts)
->(..., cohorts)
).A simple example:
The additional memory usage of this approach may be acceptable when the number of cohorts is small. It could also be mitigated by using a sparse encoding for "cohort_membership" and its derivatives.
I guess this comes down to, where do we want to strike the balance between performance and simplicity?
The text was updated successfully, but these errors were encountered: