-
Notifications
You must be signed in to change notification settings - Fork 769
[CUDA] Multi-device context support in CUDA backend #4381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
As you mention, being explicit about the devices targeted by specific operations, such as memory operations, may require both big runtime and PI changes. However, there may be a way to make it possible to have mutliple CUDA contexts per PI context without requiring too much future-expendible code and only require minimal or no changes to PI. To do this, the CUDA backend would need to create multiple platforms, each with a collection of contexts where all the devices can access each other's memory. This ability can be queried and enabled through the CUDA driver API. When creating a PI context with all multiple devices, memory allocations could either be on a central CUDA context or be distributed between the CUDA context (randomly, round-robin, most-available-memory-first, etc.). Whichever CUDA context you then launch a kernel on would be able to access the memory, albeit potentially slow. The alternative above is obviously not optimal, but it has the benefit of relying on minimal or no changes to the runtime and PI. This may also make it clearer which operations in particular will need the device arguments for the CUDA backend to make smarter decisions when migrating the the approach you mention, @AerialMantis . Additionally it would also introduce the changes to interop, which is important to get out ASAP as it is user-facing and the sooner it is changed the less user-code will be affected. |
Can CUDA plugin keep a map "memory allocation" -> device so then in each piEnqueue* API checks if the target device(associated with a queue) " has the memory allocation required be looking to the map, and if no it implicitly schedules P2P copy and updates the map? It sounds like a partial duplication of work between SYCL RT and CUDA plugin though. So, introducing clEnqueueMigrateMemObjects PI API and making SYCL RT call it for moving memory between devices from the same context sounds like a good option/optimization. Also I believe we need to have a look/prototype these solutions for Level Zero plugin. |
I don't see why not. However, if it is also useful for Level Zero (and ROCm I suspect) then I fear for a lot of duplication in the backends. Arguably it would be better for the runtime to handle this for all backends that aren't able to handle this themselves, which I suppose is what the addition of device parameters in the corresponding PI operations would be. |
Hi @steffenlarsen @romanovvlad apologies for not the delay in getting back to this, we haven't been focusing on this but we're going to start looking at this again.
In general I'm not keen on having multiple platforms for the CUDA backend as this would create a divergence in the topology mapping between the CUDA backend and other backends, which could lead to users having to special case their applications for targeting CUDA. Saying that, considering this approach I'm not sure if I fully understand. If we were to have multiple platforms, each with a collection of devices/contexts, I'm not sure how this would be mapped internally. Would this mean having a single pool of devices/contexts accessible by all platforms, in which case all platforms would reflect the same devices or would this mean having each platform have the same set of devices/contexts? In both cases I worry this could lead to an inaccurate representation of the topology, and the the latter case this would mean duplicated context allocations. You do mention that this would be sub-optimal and more of a stepping stone to implementing a full solution, and I could see that being useful, though I'm tempted to go directly the the fully integrated solution, even if we have to do that in several incremental stages.
This was my thinking as well, I would prefer to have peer-to-peer data movement invoked by the SYCL runtime rather than implicitly by PI CUDA, as that could lead to the SYCL runtime not having an accurate picture of the current location of data and possibly performing additional data movement. So if we were to implement this I was thinking we could break it up into the follow stages, which I hope could be introduced incrementally, just a draft, please let me know what you think.
|
It sounds like the SYCL context : multiple cuda contexts mapping described above maps to the level-zero equivalent (sycl context to #6104 is relevant for this (although buffers aren't mentioned yet). |
@smaslov-intel, can you comment please? |
The direction we took is to "migrate" memory in the plugins without an explicit SYCL RT calls. The reason for that is to avoid redundant copies in OpenCL RT, which already performs buffers migration under the hood. The Level-Zero plugin migration was initially added in #5966. Currently, migration means a copy from an up-to-date location to the device where the memory is going to be used, but in future we'd optimize this and enable P2P access where applicable and profitable. |
Closed by #13616 |
Describe the bug
A SYCL context can be constructed with either a single device, or multiple devices as long as all of those devices are of the same platform. However, the CUDA backend currently doesn't support the multi-device option.
This is due to a limitation in the implementation of the context in the PI plugin for CUDA, which derives from a limitation in the CUDA programming model, where a CUDA context can only be associated with a single CUDA device, and a decision in the initial implementation of the CUDA backend to map a SYCL context 1:1 with the CUDA context.
This limits the multi-device context use case which is supported by other DPC++ backends, which could potentially lead to users configuring contexts differently depending on the backend.
To Reproduce
You can reproduce this by constructing a context from multiple devices of the same platform, when targeting the CUDA backend.
Proposed solution
Note this idea is still a work in progress, but I wanted to share what I had so far to get some feedback on it.
The proposed solution here would be to alter the implementation of the PI CUDA context such that it contains multiple CUDA contexts, where each one corresponds to a CUDA device. This would allow the SYCL context to represent multiple devices as is expected.
However, the caveat to this is that the PI CUDA context would now have multiple CUDA contexts and devices, which means that any point in the DPC++ SYCL runtime where a context-specific operation needs to be performed, it would then be necessary to differentiate which CUDA context should be used, which requires knowledge of the target device.
This means that certain parts of the DPC++ SYCL runtime may need to be altered in order to ensure that when a context is needed the device is also accessible. I am still investigating this further in order to identify what specific changes would need to be made and whether this would cause any significant problems, but I have an initial high-level assessment of potential problem areas.
malloc_*
are associated with a context and a device, either directly or via a queue, however, thefree
function only takes a context, so the device which the memory was allocated on may not be known.There may be other areas to consider, but this is what I have identified so far. Some of these problem areas may also require minor modifications to the SYCL specification, I suspect and hope that won't be necessary, though it's something to consider.
Another potential problem is that the changes described above may change an underlying assumption in the DPC++ SYCL runtime (as I understand, please correct me if I'm wrong) that if a memory object is in the same context no explicit memory movement is required. A possible solution to this is to introduce a PI plugin API for moving data between two devices on the same context, which for most backends could be a no-op, though for the OpenCL this could be an opportunity to could use
clEnqueueMigrateMemObjects
, but for the CUDA backend would perform peer-to-peer copies between the contexts (as implemented in #4332).cc @alexey-bataev @steffenlarsen @Ruyk @JackAKirk
The text was updated successfully, but these errors were encountered: