Skip to content

Memcopy between devices sharing the same context doesn't work #3583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TApplencourt opened this issue Apr 20, 2021 · 3 comments
Closed

Memcopy between devices sharing the same context doesn't work #3583

TApplencourt opened this issue Apr 20, 2021 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@TApplencourt
Copy link
Contributor

TApplencourt commented Apr 20, 2021

Describe the bug

Memcopy between devices sharing the same context doesn't work. This bug is only triggered by a platform that contains multiple devices.

This is not quite a common pattern, but this will be in a near future.
Bug confirm with OpenCL and LevelZero backend

$clinfo -l
Platform #0: Intel(R) OpenCL
 +-- Device #0: Intel(R) [0x4905]
 +-- Device #1: Intel(R)  [0x4905]
 +-- Device #2: Intel(R) [0x4905]
 `-- Device #3: Intel(R) [0x4905]

To Reproduce

$cat sycl_supercontext.cpp
#include <CL/sycl.hpp>
#include <iostream>
#include <vector>
#include <numeric>

/* Send data in a circular manner in all devices of a GPU platform */

constexpr int N=10;

int main() {
  sycl::gpu_selector device_selector;
  sycl::platform platform(device_selector);

  std::vector<sycl::device> gpu_devices = platform.get_devices() ;
  const int N_devices = gpu_devices.size() ;
  if ( gpu_devices.size() <= 1 ) {
      std::cerr << "Warning, only one device on this platform."
                << " This test is not testing P2P" << std::endl;
  }

  sycl::context super_ctxt = sycl::context(gpu_devices);
  std::vector<sycl::queue> sycl_queues;
  std::vector<float *> sycl_ptr;

  // Alocate memory to each device. Each queue share the same context
  for (int i = 0; i < N_devices; i++) {
    sycl_queues.push_back(sycl::queue(super_ctxt, gpu_devices[i]) );
    float * ptr = sycl::malloc_device<float>(N, sycl_queues[i]);
    sycl_ptr.push_back(ptr);
  }

  // Allocate host data, and set the value
  std::vector<float> host_ptr(N);
  std::iota(host_ptr.begin(), host_ptr.end(), 0);

  // Copy the data to the first device, and 0ed the host data
  sycl_queues[0].memcpy(sycl_ptr[0], host_ptr.data(), N*sizeof(float)).wait();
  std::memset(host_ptr.data(), 0, N*sizeof(float));

  // The circular exchange
  for (int i = 0; i < N_devices; i++) {
     float * src = sycl_ptr[i];
     float * dst = sycl_ptr[ ( i+1 ) % N_devices ];
     sycl_queues[i].memcpy(dst, src, sizeof(float)*N).wait();
  }

  // Copy back data and check for correctness
  sycl_queues[0].memcpy(host_ptr.data(), sycl_ptr[0], N*sizeof(float)).wait();
  for( int i = 0; i < N; i++) {
    assert( host_ptr[i] == i );
  }
  return 0;
}
$dpcpp sycl_supercontext.cpp
$./a.out
Segmentation fault (core dumped)

The OpenCL command who segfault is clEnqueueMemcpyINTEL or zeCommandListAppendMemoryCopy

Environment (please complete the following information):

  • OS: Linux
  • Target device and vendor: Intel GPUs
  • DPC++ version: Intel(R) oneAPI DPC++/C++ Compiler 2021.2.0 (2021.x.0.20210323)
  • Dependencies version: .igc/release/master-2021.03.29 / intel_compute_runtime/release/master-2021.03.29
@TApplencourt TApplencourt added the bug Something isn't working label Apr 20, 2021
@abagusetty
Copy link
Contributor

Thanks @TApplencourt, just commenting for tracking

@TApplencourt
Copy link
Contributor Author

TApplencourt commented Aug 2, 2021

In https://intel.github.io/llvm-docs/MultiTileCardWithLevelZero.html#memory, On the malloc_device you have this sentence: Allocation can only be accessed by the specified device but not by other devices in the context nor by host.

In malloc_shared I saw No explicit copy is necessary for synchronizing between the host and the device, but it is needed for other devices in the context.

This explains why the code I put below is not working, indeed it's using alloc_device. When using malloc_shared the code work.

It looks like the Specification is a little vague about the behavior of alloc_device in the case of super-context. If using a P2P access in property list will allow another device in the same context to use this memory?

EDIT: Table 99: Characteristics of the different kinds of USM allocation of the sycl 2020 specification, give more information about the behavior. malloc_device require P2P where malloc_shared doesn't.

@TApplencourt
Copy link
Contributor Author

TApplencourt commented Aug 2, 2021

It was an error in my code. I will re-open when the P2P property will be implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants