Skip to content

[CUDA] Multi-device context support in CUDA backend #4381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
AerialMantis opened this issue Aug 20, 2021 · 8 comments
Closed

[CUDA] Multi-device context support in CUDA backend #4381

AerialMantis opened this issue Aug 20, 2021 · 8 comments
Labels
cuda CUDA back-end enhancement New feature or request

Comments

@AerialMantis
Copy link
Contributor

Describe the bug

A SYCL context can be constructed with either a single device, or multiple devices as long as all of those devices are of the same platform. However, the CUDA backend currently doesn't support the multi-device option.

This is due to a limitation in the implementation of the context in the PI plugin for CUDA, which derives from a limitation in the CUDA programming model, where a CUDA context can only be associated with a single CUDA device, and a decision in the initial implementation of the CUDA backend to map a SYCL context 1:1 with the CUDA context.

This limits the multi-device context use case which is supported by other DPC++ backends, which could potentially lead to users configuring contexts differently depending on the backend.

To Reproduce

You can reproduce this by constructing a context from multiple devices of the same platform, when targeting the CUDA backend.

auto platforms = cl::sycl::platform::get_platforms();
if (platforms.size() > 0) {
  auto devices = platforms[0].get_devices();
  if (devices.size() > 1) {
    // This constructor if reached will fail.
    auto multiDeviceContext = cl::sycl::context(devices));
  }
}

Proposed solution

Note this idea is still a work in progress, but I wanted to share what I had so far to get some feedback on it.

The proposed solution here would be to alter the implementation of the PI CUDA context such that it contains multiple CUDA contexts, where each one corresponds to a CUDA device. This would allow the SYCL context to represent multiple devices as is expected.

However, the caveat to this is that the PI CUDA context would now have multiple CUDA contexts and devices, which means that any point in the DPC++ SYCL runtime where a context-specific operation needs to be performed, it would then be necessary to differentiate which CUDA context should be used, which requires knowledge of the target device.

This means that certain parts of the DPC++ SYCL runtime may need to be altered in order to ensure that when a context is needed the device is also accessible. I am still investigating this further in order to identify what specific changes would need to be made and whether this would cause any significant problems, but I have an initial high-level assessment of potential problem areas.

  • Buffers - generally memory allocations and copy operations are derived from command groups which have knowledge of the device, however, there may be cases where the device associated with a dependency is not available.
  • USM - generally USM operations such as malloc_* are associated with a context and a device, either directly or via a queue, however, the free function only takes a context, so the device which the memory was allocated on may not be known.
  • Events - generally an event is associated with a context and doesn't have any knowledge of the device where the event came from.
  • Kernel bundles - kernel bundles are also associated with a context, though they are also associated with a set of devices, so this might require some changes, but I suspect that this shouldn't cause an issue.
  • Interop - Interop with the CUDA backend would need to change in that the native object for a context would become a vector of CUDA contexts and there would be an implicit relationship with the devices they are associated with that would need to be documented.

There may be other areas to consider, but this is what I have identified so far. Some of these problem areas may also require minor modifications to the SYCL specification, I suspect and hope that won't be necessary, though it's something to consider.

Another potential problem is that the changes described above may change an underlying assumption in the DPC++ SYCL runtime (as I understand, please correct me if I'm wrong) that if a memory object is in the same context no explicit memory movement is required. A possible solution to this is to introduce a PI plugin API for moving data between two devices on the same context, which for most backends could be a no-op, though for the OpenCL this could be an opportunity to could use clEnqueueMigrateMemObjects, but for the CUDA backend would perform peer-to-peer copies between the contexts (as implemented in #4332).

cc @alexey-bataev @steffenlarsen @Ruyk @JackAKirk

@steffenlarsen
Copy link
Contributor

As you mention, being explicit about the devices targeted by specific operations, such as memory operations, may require both big runtime and PI changes.

However, there may be a way to make it possible to have mutliple CUDA contexts per PI context without requiring too much future-expendible code and only require minimal or no changes to PI. To do this, the CUDA backend would need to create multiple platforms, each with a collection of contexts where all the devices can access each other's memory. This ability can be queried and enabled through the CUDA driver API. When creating a PI context with all multiple devices, memory allocations could either be on a central CUDA context or be distributed between the CUDA context (randomly, round-robin, most-available-memory-first, etc.). Whichever CUDA context you then launch a kernel on would be able to access the memory, albeit potentially slow.

The alternative above is obviously not optimal, but it has the benefit of relying on minimal or no changes to the runtime and PI. This may also make it clearer which operations in particular will need the device arguments for the CUDA backend to make smarter decisions when migrating the the approach you mention, @AerialMantis . Additionally it would also introduce the changes to interop, which is important to get out ASAP as it is user-facing and the sooner it is changed the less user-code will be affected.

@romanovvlad
Copy link
Contributor

Whichever CUDA context you then launch a kernel on would be able to access the memory, albeit potentially slow.

Can CUDA plugin keep a map "memory allocation" -> device so then in each piEnqueue* API checks if the target device(associated with a queue) " has the memory allocation required be looking to the map, and if no it implicitly schedules P2P copy and updates the map? It sounds like a partial duplication of work between SYCL RT and CUDA plugin though.

So, introducing clEnqueueMigrateMemObjects PI API and making SYCL RT call it for moving memory between devices from the same context sounds like a good option/optimization.

Also I believe we need to have a look/prototype these solutions for Level Zero plugin.

@steffenlarsen
Copy link
Contributor

Can CUDA plugin keep a map "memory allocation" -> device so then in each piEnqueue* API checks if the target device(associated with a queue) " has the memory allocation required be looking to the map, and if no it implicitly schedules P2P copy and updates the map? It sounds like a partial duplication of work between SYCL RT and CUDA plugin though.

I don't see why not. However, if it is also useful for Level Zero (and ROCm I suspect) then I fear for a lot of duplication in the backends. Arguably it would be better for the runtime to handle this for all backends that aren't able to handle this themselves, which I suppose is what the addition of device parameters in the corresponding PI operations would be.

@AerialMantis AerialMantis added enhancement New feature or request and removed bug Something isn't working labels Mar 1, 2022
@AerialMantis
Copy link
Contributor Author

Hi @steffenlarsen @romanovvlad apologies for not the delay in getting back to this, we haven't been focusing on this but we're going to start looking at this again.

However, there may be a way to make it possible to have mutliple CUDA contexts per PI context without requiring too much future-expendible code and only require minimal or no changes to PI. To do this, the CUDA backend would need to create multiple platforms, each with a collection of contexts where all the devices can access each other's memory. This ability can be queried and enabled through the CUDA driver API. When creating a PI context with all multiple devices, memory allocations could either be on a central CUDA context or be distributed between the CUDA context (randomly, round-robin, most-available-memory-first, etc.). Whichever CUDA context you then launch a kernel on would be able to access the memory, albeit potentially slow.

In general I'm not keen on having multiple platforms for the CUDA backend as this would create a divergence in the topology mapping between the CUDA backend and other backends, which could lead to users having to special case their applications for targeting CUDA.

Saying that, considering this approach I'm not sure if I fully understand. If we were to have multiple platforms, each with a collection of devices/contexts, I'm not sure how this would be mapped internally. Would this mean having a single pool of devices/contexts accessible by all platforms, in which case all platforms would reflect the same devices or would this mean having each platform have the same set of devices/contexts? In both cases I worry this could lead to an inaccurate representation of the topology, and the the latter case this would mean duplicated context allocations.

You do mention that this would be sub-optimal and more of a stepping stone to implementing a full solution, and I could see that being useful, though I'm tempted to go directly the the fully integrated solution, even if we have to do that in several incremental stages.

Can CUDA plugin keep a map "memory allocation" -> device so then in each piEnqueue* API checks if the target device(associated with a queue) " has the memory allocation required be looking to the map, and if no it implicitly schedules P2P copy and updates the map? It sounds like a partial duplication of work between SYCL RT and CUDA plugin though.

So, introducing clEnqueueMigrateMemObjects PI API and making SYCL RT call it for moving memory between devices from the same context sounds like a good option/optimization.

This was my thinking as well, I would prefer to have peer-to-peer data movement invoked by the SYCL runtime rather than implicitly by PI CUDA, as that could lead to the SYCL runtime not having an accurate picture of the current location of data and possibly performing additional data movement.

So if we were to implement this I was thinking we could break it up into the follow stages, which I hope could be introduced incrementally, just a draft, please let me know what you think.

  • Implement a PI API for clEnqueueMigrateMemObjects, which for most backends is a no-op, but for the CUDA backend calls the peer-to-peer copies implemented in [CUDA] P2P buffer/image memory copy #4401.
  • Introduce an invocation of the PI API for clEnqueueMigrateMemObjects when resolving data dependencies for a command, initially doing nothing as non-CUDA backends are implemented as a no-op and the CUDA backend is still representing each device within a unique platform ([SYCL][CUDA][HIP] Report every device in its own platform #4571).
  • Introduce a mechanism in the SYCL RT to track the current device a memory object is allocated on as well as the context, this could allow for the calls to the PI API for clEnqueueMigrateMemObjects to be conditional.
  • Update PI APIs which take a context to perform an operation with it to also take the device associated with it (this will require some further investigation), this would be ignored at first by all backends.
  • Implement the CUDA backend to represent a context by multiple CUDA devices and contexts, with a 1:1 mapping.
  • Implement the PI APIs for the CUDA backend which now take a context and a device to index the correct context for the specified device.
  • Revert the change in [SYCL][CUDA][HIP] Report every device in its own platform #4571 to have the CUDA backend represent all devices in the same platform again.

@JackAKirk
Copy link
Contributor

JackAKirk commented May 10, 2022

It sounds like the SYCL context : multiple cuda contexts mapping described above maps to the level-zero equivalent (sycl context to ze_context_handle_t) from the level-zero backend spec: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_oneapi_backend_level_zero.md
I think ideally there would be a single model across at least these two backends for doing peer to peer copies, if possible. I think it would be ideal to clarify this before moving forward with the CUDA side.

#6104 is relevant for this (although buffers aren't mentioned yet).

@pvchupin
Copy link
Contributor

@smaslov-intel, can you comment please?

@smaslov-intel
Copy link
Contributor

So, introducing clEnqueueMigrateMemObjects PI API and making SYCL RT call it for moving memory between devices from the same context sounds like a good option/optimization.

The direction we took is to "migrate" memory in the plugins without an explicit SYCL RT calls. The reason for that is to avoid redundant copies in OpenCL RT, which already performs buffers migration under the hood. The Level-Zero plugin migration was initially added in #5966. Currently, migration means a copy from an up-to-date location to the device where the memory is going to be used, but in future we'd optimize this and enable P2P access where applicable and profitable.

@JackAKirk
Copy link
Contributor

Closed by #13616

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda CUDA back-end enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants