Skip to content

WIP: Pairwise distance dask cuda blog post #50

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 21 commits into from

Conversation

alimanfoo
Copy link
Contributor

Initial work towards a pairwise distance blog post, collaborating with @jakirkham.

@mrocklin
Copy link
Member

mrocklin commented Aug 9, 2019

Thanks for doing this @alimanfoo and @jakirkham

From our brief encounter on this last month, I also found the narrative from your numba.cuda.jit exploration motivating

  1. Wrote something on the CPU
  2. Tried numba.cuda.jit on a low-end GPU, wasn't any faster
  3. Wrote it again, got a modest speedup
  4. Tried it on a V100, got 100x speedup
  5. Did a bunch of things (what were they?) got up to 10x speedup on laptop

Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting this, @kkraus14 had suggested using forall as a way to avoid setting the threads to use. @alimanfoo found that this worked very nicely both from a user perspective and a performance perspective.

Though one thing that this use case highlights here is users will want to loop over several dimensions, which forall doesn't support currently (unless we are missing something?). As a result, the coordinates get raveled and then unraveled in the kernel.

If this pattern of raveling/unraveling with forall is reasonable, it would be handy to have this handled for the user (for example calling .forall((n, n-1))). Thus making it a bit easier to write Numba CUDA kernels. Thoughts @seibert? Happy to raise an issue to discuss this further if it is of interest.

" out = cuda.device_array(n_pairs, dtype=np.float32)\n",
"\n",
" # Let numba decide number of threads and blocks.\n",
" kernel_spec = kernel_cityblock_cuda.forall(n_pairs)\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we use the raveled index to pass to forall (n_pairs is computed a few lines earlier).

" pair_index = cuda.grid(1)\n",
" if pair_index < n_pairs:\n",
" # Unpack the pair index to column indices.\n",
" j, k = square_coords_cuda(pair_index, n)\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is where the unraveling is occurring for context.

"source": [
"def pairwise_cityblock_cpu(x):\n",
" assert x.ndim == 2\n",
" out = spd.pdist(x.T, metric='cityblock')\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To record a point from earlier for follow up, pdist is likely using float64. Though the kernels written below are float32. Would be good to standardize these a bit to have a fair comparison. That said, this will likely only be a 2x change relative to what we have here (unless I'm missing things).

One option would be to write a Numba kernel for this case as well. Though after talking with @alimanfoo, it seems SciPy's implementation is a bit more performant than a naive Numba kernel. Also SciPy seems to handle C and Fortran order arrays with identical performance. So there may be some other tricks buried in the SciPy implementation that would be worth learning from.

"metadata": {},
"outputs": [],
"source": [
"# launch a local cuda cluster?"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that just running with multiple Dask workers using large data on a single GPU doesn't yield much here as the GPU is basically saturated with work. Though we suspect having multiple GPUs will be more useful as one can load chunks of data to compute on each GPU to get a speed up from parallelization.

@alimanfoo
Copy link
Contributor Author

Thanks @jakirkham. Just to xref numba/numba#4309 in which I asked about another possible way to simplify running cuda kernels, including over multiple dimensions. As a user I would generally be happy with any approach that means I don't have to hard-code the number of threads-per-block. I realise that's maybe hard to do in all cases, but if there was a reasonable approximation it could remove an entry barrier to programming cuda kernels.

An interesting point of detail here as well, if the pairwise distance was run as a 2D grid, that would mean n**2 threads get scheduled, but we only want to run n choose 2 threads. It would be easy to add some logic so that a thread only runs if it is in the upper triangle, for example, but then is there any cost to scheduling around twice as many threads as are needed? Just a point of curiosity.

@mrocklin
Copy link
Member

mrocklin commented Aug 9, 2019 via email

@jakirkham
Copy link
Member

I wonder if guvectorize can help here when parallelizing across additional
dimensions?

Yeah we chatted about this as well. I think it would be a good thing to try.

It's worth noting that pairwise distance is raveling the results a bit. So there may be a trick that we didn't quite wrap our heads around. Certainly it makes sense for the non-pairwise distance case.

@jakirkham
Copy link
Member

An interesting point of detail here as well, if the pairwise distance was run as a 2D grid, that would mean n**2 threads get scheduled, but we only want to run n choose 2 threads. It would be easy to add some logic so that a thread only runs if it is in the upper triangle, for example, but then is there any cost to scheduling around twice as many threads as are needed? Just a point of curiosity.

This is a good question. I'm not sure the answer. Hopefully one of the other CUDA experts cc'd can weigh in.

@jakirkham
Copy link
Member

Thanks for working on this a bit with me, @alimanfoo, and sharing your experience thus far. Tried to capture our conversation from earlier in the comments. Please feel free to fill in or correct anything as needed.

@stuartarchibald
Copy link

Just noting this, @kkraus14 had suggested using forall as a way to avoid setting the threads to use. @alimanfoo found that this worked very nicely both from a user perspective and a performance perspective.

Though one thing that this use case highlights here is users will want to loop over several dimensions, which forall doesn't support currently (unless we are missing something?). As a result, the coordinates get raveled and then unraveled in the kernel.

If this pattern of raveling/unraveling with forall is reasonable, it would be handy to have this handled for the user (for example calling .forall((n, n-1))). Thus making it a bit easier to write Numba CUDA kernels. Thoughts @siebert? Happy to raise an issue to discuss this further if it is of interest.

CC @seibert ^ assuming that's a typo and Stan is meant?

@jakirkham
Copy link
Member

CC @seibert ^ assuming that's a typo and Stan is meant?

Yep, thanks for catching and correcting that Stuart! 😄

@jakirkham
Copy link
Member

I took the liberty of turning this into markdown and filling in some text. Hope that is ok @alimanfoo. Happy to revert, move around, and work through stuff here as you see fit. Also please feel free to overwrite changes I've made if you see this going a different direction or if I've missed things. The genomics bits in particular are pretty rough as I'm definitely not the expert there. 😄

@jakirkham
Copy link
Member

Also @mrocklin if you have other suggestions here, they would be welcome. 🙂

@alimanfoo
Copy link
Contributor Author

Thanks @jakirkham. I don't think I'll get to this before going on leave, but if you don't mind hanging on to this for a couple of weeks I'd be happy to help finish it up when back.

Also happy to add a few notes on things that I tried in earlier experiments that didn't work so well as suggested by @mrocklin if you think it would be worth it (basically it was trying the kernel parallelising over rows instead of pairs, which then required an atomic add, which I think slows it down).

@jakirkham
Copy link
Member

jakirkham commented Aug 13, 2019

Yep, no worries. Just wanted to fill in some things so we have an idea of what is here and what to do when you are back. Figured it might make things more approachable when we return.

That could be interesting. As I filled this out, there were some points in the narrative that could benefit from these earlier parts of the story. Like how did you implement the distance function before using forall? Also how well did that work? Were there any tricks that others should be aware of?

@alimanfoo
Copy link
Contributor Author

alimanfoo commented Aug 13, 2019 via email

@jakirkham
Copy link
Member

Yeah there was a sizable bit near the end that I condensed down to a small bit of code and a few lines of text. There is a balance to be struck between info and length. Not sure what the right balance is in this case, but think we will be able to figure it out. My hope is we are kind of close with the content here.

Copy link

@stuartarchibald stuartarchibald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this, I've read through and think it's a good self contained use case for demonstrating how to go from a CPU alg to a potentially distributed GPU alg. The narrative does a good job of explaining a method to perform such changes.

I've made a mixture of suggestions (acceptance is entirely optional :)) categorised approximately as follows:

  • stylistic/wording
  • ways to reinforce the narrative/motivation
  • additional items to help develop the CPU->GPU strategy described within

There's also the question of determining whether a kernel and the launch config is making good use of the hardware. Perhaps that should/could be a follow-up and could contain more code tweaks and exploration of performance analysis tooling?


First we need to move our data to the GPU. There are several ways we could do
this. Though one reasonably approachable way is to use Numba. We could write
the following to handle our in-memory case.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In memory with respect to what? In machine RAM vs in GPU RAM.



@cuda.jit
def kernel_cityblock_cuda(x, out):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it would be useful to have a version of the CPU equivalent algorithm for comparison as it's not obvious the level of effort required to convert from CPU to CUDA as the current baseline CPU version is a SciPy function call (scipy.spatial.distance.spd)? A useful side effect of this would be a nopython mode JIT compiled variant which gives a hint at CPU/parallel CPU performance vs. CUDA.

A further side effect would be that it might be help in working out a strategy for conversion of such algs. njit(cpu) -> guvectorize(cpu) -> guvectorize(cuda) -> cuda kernels -> optimising cuda kernels (mixed in with dask?)?



@cuda.jit(device=True)
def square_coords_cuda(pair_index, n):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some LaTeX spelling out the maths and/or a wiki link to a canonical form of this alg would be beneficial.


```python
dist_cuda = pairwise_cityblock_cuda(x_cuda)
cuda.synchronize()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to explain why this is needed, it's a common stumbling block in assessing the performance of CUDA functions (Numba core devs answer this question regularly)?


```
dist_big_cuda = pairwise_cityblock_dask(x_big_dask_cuda, f=pairwise_cityblock_cuda).compute()
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd probably be worth making this into a notebook so that it can be downloaded and executed, it'd also shake out any bugs in the code blocks in the document? Having some canned/expected output would definitely help with ensuring that algorithm translations etc are done correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this started as a notebook originally. Then we moved it into this blogpost form so we could publish and share it a bit more broadly.

@jakirkham
Copy link
Member

@alimanfoo , would you have a chance to go through this and fill in any scientific background I may have lost? Also feel free to change any of the text as you see fit. 🙂

@alimanfoo
Copy link
Contributor Author

@jakirkham apologies, I've been totally buried. Will try to get back to this asap. ccing @jeromekelleher for interest.

Co-authored-by: stuartarchibald <[email protected]>
@jacobtomlinson
Copy link
Member

It has been a couple of years since there has been activity here. Thanks for all the effort that went into this but I'm going to close it out for now in an effort to tidy things here a little. It would be great to see more submissions like this though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants