-
-
Notifications
You must be signed in to change notification settings - Fork 35
WIP: Pairwise distance dask cuda blog post #50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for doing this @alimanfoo and @jakirkham From our brief encounter on this last month, I also found the narrative from your numba.cuda.jit exploration motivating
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting this, @kkraus14 had suggested using forall
as a way to avoid setting the threads to use. @alimanfoo found that this worked very nicely both from a user perspective and a performance perspective.
Though one thing that this use case highlights here is users will want to loop over several dimensions, which forall
doesn't support currently (unless we are missing something?). As a result, the coordinates get raveled and then unraveled in the kernel.
If this pattern of raveling/unraveling with forall
is reasonable, it would be handy to have this handled for the user (for example calling .forall((n, n-1))
). Thus making it a bit easier to write Numba CUDA kernels. Thoughts @seibert? Happy to raise an issue to discuss this further if it is of interest.
" out = cuda.device_array(n_pairs, dtype=np.float32)\n", | ||
"\n", | ||
" # Let numba decide number of threads and blocks.\n", | ||
" kernel_spec = kernel_cityblock_cuda.forall(n_pairs)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we use the raveled index to pass to forall
(n_pairs
is computed a few lines earlier).
" pair_index = cuda.grid(1)\n", | ||
" if pair_index < n_pairs:\n", | ||
" # Unpack the pair index to column indices.\n", | ||
" j, k = square_coords_cuda(pair_index, n)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is where the unraveling is occurring for context.
"source": [ | ||
"def pairwise_cityblock_cpu(x):\n", | ||
" assert x.ndim == 2\n", | ||
" out = spd.pdist(x.T, metric='cityblock')\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To record a point from earlier for follow up, pdist
is likely using float64
. Though the kernels written below are float32
. Would be good to standardize these a bit to have a fair comparison. That said, this will likely only be a 2x change relative to what we have here (unless I'm missing things).
One option would be to write a Numba kernel for this case as well. Though after talking with @alimanfoo, it seems SciPy's implementation is a bit more performant than a naive Numba kernel. Also SciPy seems to handle C and Fortran order arrays with identical performance. So there may be some other tricks buried in the SciPy implementation that would be worth learning from.
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# launch a local cuda cluster?" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that just running with multiple Dask workers using large data on a single GPU doesn't yield much here as the GPU is basically saturated with work. Though we suspect having multiple GPUs will be more useful as one can load chunks of data to compute on each GPU to get a speed up from parallelization.
Thanks @jakirkham. Just to xref numba/numba#4309 in which I asked about another possible way to simplify running cuda kernels, including over multiple dimensions. As a user I would generally be happy with any approach that means I don't have to hard-code the number of threads-per-block. I realise that's maybe hard to do in all cases, but if there was a reasonable approximation it could remove an entry barrier to programming cuda kernels. An interesting point of detail here as well, if the pairwise distance was run as a 2D grid, that would mean |
I wonder if guvectorize can help here when parallelizing across additional
dimensions? It seems like that might also handle some of the ambiguity
around threads
…On Fri, Aug 9, 2019 at 12:25 PM Alistair Miles ***@***.***> wrote:
Thanks @jakirkham <https://github.com/jakirkham>. Just to xref
numba/numba#4309 <numba/numba#4309> in which I
asked about another possible way to simplify running cuda kernels,
including over multiple dimensions. As a user I would generally be happy
with any approach that means I don't have to hard-code the number of
threads-per-block. I realise that's maybe hard to do in all cases, but if
there was a reasonable approximation it could remove an entry barrier to
programming cuda kernels.
An interesting point of detail here as well, if the pairwise distance was
run as a 2D grid, that would mean n**2 threads get scheduled, but we only
want to run n choose 2 threads. It would be easy to add some logic so
that a thread only runs if it is in the upper triangle, for example, but
then is there any cost to scheduling around twice as many threads as are
needed? Just a point of curiosity.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#50?email_source=notifications&email_token=AACKZTFIPQTD7P4KRGYV5BDQDWK7HA5CNFSM4IKUSFVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD37EQGY#issuecomment-519981083>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AACKZTF5BWACE6C5TU3WADLQDWK7HANCNFSM4IKUSFVA>
.
|
Yeah we chatted about this as well. I think it would be a good thing to try. It's worth noting that pairwise distance is raveling the results a bit. So there may be a trick that we didn't quite wrap our heads around. Certainly it makes sense for the non-pairwise distance case. |
This is a good question. I'm not sure the answer. Hopefully one of the other CUDA experts cc'd can weigh in. |
Thanks for working on this a bit with me, @alimanfoo, and sharing your experience thus far. Tried to capture our conversation from earlier in the comments. Please feel free to fill in or correct anything as needed. |
CC @seibert ^ assuming that's a typo and Stan is meant? |
Yep, thanks for catching and correcting that Stuart! 😄 |
History preserving measuring before converting the notebook to markdown.
I took the liberty of turning this into markdown and filling in some text. Hope that is ok @alimanfoo. Happy to revert, move around, and work through stuff here as you see fit. Also please feel free to overwrite changes I've made if you see this going a different direction or if I've missed things. The genomics bits in particular are pretty rough as I'm definitely not the expert there. 😄 |
Also @mrocklin if you have other suggestions here, they would be welcome. 🙂 |
Thanks @jakirkham. I don't think I'll get to this before going on leave, but if you don't mind hanging on to this for a couple of weeks I'd be happy to help finish it up when back. Also happy to add a few notes on things that I tried in earlier experiments that didn't work so well as suggested by @mrocklin if you think it would be worth it (basically it was trying the kernel parallelising over rows instead of pairs, which then required an atomic add, which I think slows it down). |
Yep, no worries. Just wanted to fill in some things so we have an idea of what is here and what to do when you are back. Figured it might make things more approachable when we return. That could be interesting. As I filled this out, there were some points in the narrative that could benefit from these earlier parts of the story. Like how did you implement the distance function before using |
That could be interesting. As I filled this out, there were some points in
the narrative that could benefit from these earlier parts of the story.
Like how did you implement the distance function before using forall?
Yep, could add that too. I wondered if it might be good to add some of
these things in as a kind of post-script, something like "other things we
tried", so the main body of the post is relatively concise, but there are
some further details for the interested reader. Happy to do it either way.
|
Yeah there was a sizable bit near the end that I condensed down to a small bit of code and a few lines of text. There is a balance to be struck between info and length. Not sure what the right balance is in this case, but think we will be able to figure it out. My hope is we are kind of close with the content here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing this, I've read through and think it's a good self contained use case for demonstrating how to go from a CPU alg to a potentially distributed GPU alg. The narrative does a good job of explaining a method to perform such changes.
I've made a mixture of suggestions (acceptance is entirely optional :)) categorised approximately as follows:
- stylistic/wording
- ways to reinforce the narrative/motivation
- additional items to help develop the CPU->GPU strategy described within
There's also the question of determining whether a kernel and the launch config is making good use of the hardware. Perhaps that should/could be a follow-up and could contain more code tweaks and exploration of performance analysis tooling?
|
||
First we need to move our data to the GPU. There are several ways we could do | ||
this. Though one reasonably approachable way is to use Numba. We could write | ||
the following to handle our in-memory case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In memory with respect to what? In machine RAM vs in GPU RAM.
|
||
|
||
@cuda.jit | ||
def kernel_cityblock_cuda(x, out): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it would be useful to have a version of the CPU equivalent algorithm for comparison as it's not obvious the level of effort required to convert from CPU to CUDA as the current baseline CPU version is a SciPy function call (scipy.spatial.distance.spd
)? A useful side effect of this would be a nopython
mode JIT compiled variant which gives a hint at CPU/parallel CPU performance vs. CUDA.
A further side effect would be that it might be help in working out a strategy for conversion of such algs. njit(cpu) -> guvectorize(cpu) -> guvectorize(cuda) -> cuda kernels -> optimising cuda kernels
(mixed in with dask?)?
|
||
|
||
@cuda.jit(device=True) | ||
def square_coords_cuda(pair_index, n): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some LaTeX spelling out the maths and/or a wiki link to a canonical form of this alg would be beneficial.
|
||
```python | ||
dist_cuda = pairwise_cityblock_cuda(x_cuda) | ||
cuda.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to explain why this is needed, it's a common stumbling block in assessing the performance of CUDA functions (Numba core devs answer this question regularly)?
|
||
``` | ||
dist_big_cuda = pairwise_cityblock_dask(x_big_dask_cuda, f=pairwise_cityblock_cuda).compute() | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd probably be worth making this into a notebook so that it can be downloaded and executed, it'd also shake out any bugs in the code blocks in the document? Having some canned/expected output would definitely help with ensuring that algorithm translations etc are done correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this started as a notebook originally. Then we moved it into this blogpost form so we could publish and share it a bit more broadly.
@alimanfoo , would you have a chance to go through this and fill in any scientific background I may have lost? Also feel free to change any of the text as you see fit. 🙂 |
@jakirkham apologies, I've been totally buried. Will try to get back to this asap. ccing @jeromekelleher for interest. |
Co-authored-by: stuartarchibald <[email protected]>
It has been a couple of years since there has been activity here. Thanks for all the effort that went into this but I'm going to close it out for now in an effort to tidy things here a little. It would be great to see more submissions like this though. |
Initial work towards a pairwise distance blog post, collaborating with @jakirkham.