WIP: Pairwise distance dask cuda blog post #50

alimanfoo · 2019-08-09T14:28:58Z

Initial work towards a pairwise distance blog post, collaborating with @jakirkham.

mrocklin · 2019-08-09T14:44:35Z

Thanks for doing this @alimanfoo and @jakirkham

From our brief encounter on this last month, I also found the narrative from your numba.cuda.jit exploration motivating

Wrote something on the CPU
Tried numba.cuda.jit on a low-end GPU, wasn't any faster
Wrote it again, got a modest speedup
Tried it on a V100, got 100x speedup
Did a bunch of things (what were they?) got up to 10x speedup on laptop

jakirkham

Just noting this, @kkraus14 had suggested using forall as a way to avoid setting the threads to use. @alimanfoo found that this worked very nicely both from a user perspective and a performance perspective.

Though one thing that this use case highlights here is users will want to loop over several dimensions, which forall doesn't support currently (unless we are missing something?). As a result, the coordinates get raveled and then unraveled in the kernel.

If this pattern of raveling/unraveling with forall is reasonable, it would be handy to have this handled for the user (for example calling .forall((n, n-1))). Thus making it a bit easier to write Numba CUDA kernels. Thoughts @seibert? Happy to raise an issue to discuss this further if it is of interest.

jakirkham · 2019-08-09T16:11:56Z

_posts/2019-08-09-pairwise-distance.ipynb

+    "    out = cuda.device_array(n_pairs, dtype=np.float32)\n",
+    "\n",
+    "    # Let numba decide number of threads and blocks.\n",
+    "    kernel_spec = kernel_cityblock_cuda.forall(n_pairs)\n",


Here we use the raveled index to pass to forall (n_pairs is computed a few lines earlier).

jakirkham · 2019-08-09T16:12:11Z

_posts/2019-08-09-pairwise-distance.ipynb

+    "    pair_index = cuda.grid(1)\n",
+    "    if pair_index < n_pairs:\n",
+    "        # Unpack the pair index to column indices.\n",
+    "        j, k = square_coords_cuda(pair_index, n)\n",


Here is where the unraveling is occurring for context.

jakirkham · 2019-08-09T16:21:10Z

_posts/2019-08-09-pairwise-distance.ipynb

+   "source": [
+    "def pairwise_cityblock_cpu(x):\n",
+    "    assert x.ndim == 2\n",
+    "    out = spd.pdist(x.T, metric='cityblock')\n",


To record a point from earlier for follow up, pdist is likely using float64. Though the kernels written below are float32. Would be good to standardize these a bit to have a fair comparison. That said, this will likely only be a 2x change relative to what we have here (unless I'm missing things).

One option would be to write a Numba kernel for this case as well. Though after talking with @alimanfoo, it seems SciPy's implementation is a bit more performant than a naive Numba kernel. Also SciPy seems to handle C and Fortran order arrays with identical performance. So there may be some other tricks buried in the SciPy implementation that would be worth learning from.

jakirkham · 2019-08-09T16:24:44Z

_posts/2019-08-09-pairwise-distance.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# launch a local cuda cluster?"


It seems that just running with multiple Dask workers using large data on a single GPU doesn't yield much here as the GPU is basically saturated with work. Though we suspect having multiple GPUs will be more useful as one can load chunks of data to compute on each GPU to get a speed up from parallelization.

alimanfoo · 2019-08-09T16:25:23Z

Thanks @jakirkham. Just to xref numba/numba#4309 in which I asked about another possible way to simplify running cuda kernels, including over multiple dimensions. As a user I would generally be happy with any approach that means I don't have to hard-code the number of threads-per-block. I realise that's maybe hard to do in all cases, but if there was a reasonable approximation it could remove an entry barrier to programming cuda kernels.

An interesting point of detail here as well, if the pairwise distance was run as a 2D grid, that would mean n**2 threads get scheduled, but we only want to run n choose 2 threads. It would be easy to add some logic so that a thread only runs if it is in the upper triangle, for example, but then is there any cost to scheduling around twice as many threads as are needed? Just a point of curiosity.

mrocklin · 2019-08-09T16:26:42Z

I wonder if guvectorize can help here when parallelizing across additional dimensions? It seems like that might also handle some of the ambiguity around threads

…

On Fri, Aug 9, 2019 at 12:25 PM Alistair Miles ***@***.***> wrote: Thanks @jakirkham <https://github.com/jakirkham>. Just to xref numba/numba#4309 <numba/numba#4309> in which I asked about another possible way to simplify running cuda kernels, including over multiple dimensions. As a user I would generally be happy with any approach that means I don't have to hard-code the number of threads-per-block. I realise that's maybe hard to do in all cases, but if there was a reasonable approximation it could remove an entry barrier to programming cuda kernels. An interesting point of detail here as well, if the pairwise distance was run as a 2D grid, that would mean n**2 threads get scheduled, but we only want to run n choose 2 threads. It would be easy to add some logic so that a thread only runs if it is in the upper triangle, for example, but then is there any cost to scheduling around twice as many threads as are needed? Just a point of curiosity. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#50?email_source=notifications&email_token=AACKZTFIPQTD7P4KRGYV5BDQDWK7HA5CNFSM4IKUSFVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD37EQGY#issuecomment-519981083>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AACKZTF5BWACE6C5TU3WADLQDWK7HANCNFSM4IKUSFVA> .

jakirkham · 2019-08-09T16:29:10Z

I wonder if guvectorize can help here when parallelizing across additional
dimensions?

Yeah we chatted about this as well. I think it would be a good thing to try.

It's worth noting that pairwise distance is raveling the results a bit. So there may be a trick that we didn't quite wrap our heads around. Certainly it makes sense for the non-pairwise distance case.

jakirkham · 2019-08-09T16:30:52Z

An interesting point of detail here as well, if the pairwise distance was run as a 2D grid, that would mean n**2 threads get scheduled, but we only want to run n choose 2 threads. It would be easy to add some logic so that a thread only runs if it is in the upper triangle, for example, but then is there any cost to scheduling around twice as many threads as are needed? Just a point of curiosity.

This is a good question. I'm not sure the answer. Hopefully one of the other CUDA experts cc'd can weigh in.

jakirkham · 2019-08-09T16:34:27Z

Thanks for working on this a bit with me, @alimanfoo, and sharing your experience thus far. Tried to capture our conversation from earlier in the comments. Please feel free to fill in or correct anything as needed.

stuartarchibald · 2019-08-09T16:38:22Z

Just noting this, @kkraus14 had suggested using forall as a way to avoid setting the threads to use. @alimanfoo found that this worked very nicely both from a user perspective and a performance perspective.

Though one thing that this use case highlights here is users will want to loop over several dimensions, which forall doesn't support currently (unless we are missing something?). As a result, the coordinates get raveled and then unraveled in the kernel.

If this pattern of raveling/unraveling with forall is reasonable, it would be handy to have this handled for the user (for example calling .forall((n, n-1))). Thus making it a bit easier to write Numba CUDA kernels. Thoughts @siebert? Happy to raise an issue to discuss this further if it is of interest.

CC @seibert ^ assuming that's a typo and Stan is meant?

jakirkham · 2019-08-09T20:53:32Z

CC @seibert ^ assuming that's a typo and Stan is meant?

Yep, thanks for catching and correcting that Stuart! 😄

History preserving measuring before converting the notebook to markdown.

jakirkham · 2019-08-12T22:17:58Z

I took the liberty of turning this into markdown and filling in some text. Hope that is ok @alimanfoo. Happy to revert, move around, and work through stuff here as you see fit. Also please feel free to overwrite changes I've made if you see this going a different direction or if I've missed things. The genomics bits in particular are pretty rough as I'm definitely not the expert there. 😄

jakirkham · 2019-08-12T22:18:53Z

Also @mrocklin if you have other suggestions here, they would be welcome. 🙂

alimanfoo · 2019-08-13T00:33:17Z

Thanks @jakirkham. I don't think I'll get to this before going on leave, but if you don't mind hanging on to this for a couple of weeks I'd be happy to help finish it up when back.

Also happy to add a few notes on things that I tried in earlier experiments that didn't work so well as suggested by @mrocklin if you think it would be worth it (basically it was trying the kernel parallelising over rows instead of pairs, which then required an atomic add, which I think slows it down).

jakirkham · 2019-08-13T10:03:52Z

Yep, no worries. Just wanted to fill in some things so we have an idea of what is here and what to do when you are back. Figured it might make things more approachable when we return.

That could be interesting. As I filled this out, there were some points in the narrative that could benefit from these earlier parts of the story. Like how did you implement the distance function before using forall? Also how well did that work? Were there any tricks that others should be aware of?

alimanfoo · 2019-08-13T10:44:29Z

That could be interesting. As I filled this out, there were some points in the narrative that could benefit from these earlier parts of the story. Like how did you implement the distance function before using forall?

Yep, could add that too. I wondered if it might be good to add some of these things in as a kind of post-script, something like "other things we tried", so the main body of the post is relatively concise, but there are some further details for the interested reader. Happy to do it either way.

jakirkham · 2019-08-13T13:30:03Z

Yeah there was a sizable bit near the end that I condensed down to a small bit of code and a few lines of text. There is a balance to be struck between info and length. Not sure what the right balance is in this case, but think we will be able to figure it out. My hope is we are kind of close with the content here.

stuartarchibald

Thanks for writing this, I've read through and think it's a good self contained use case for demonstrating how to go from a CPU alg to a potentially distributed GPU alg. The narrative does a good job of explaining a method to perform such changes.

I've made a mixture of suggestions (acceptance is entirely optional :)) categorised approximately as follows:

stylistic/wording
ways to reinforce the narrative/motivation
additional items to help develop the CPU->GPU strategy described within

There's also the question of determining whether a kernel and the launch config is making good use of the hardware. Perhaps that should/could be a follow-up and could contain more code tweaks and exploration of performance analysis tooling?