-
Notifications
You must be signed in to change notification settings - Fork 770
[CUDA] P2P buffer/image memory copy #4401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
4a71379
c771093
b4a99ff
b38d5e2
6abe9fb
a3c251e
3cd6911
c384fbe
27c073d
60f276d
835e5c4
9e3ef85
0f36b94
0d819c0
43f970c
9b9a21f
52cae9f
24b240a
99e58f1
fd15910
ccba446
0f95aa7
7057849
9d5a84a
b64484d
ebfdb2a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1637,6 +1637,15 @@ pi_result cuda_piDeviceGetInfo(pi_device device, pi_device_info param_name, | |
} | ||
return getInfo(param_value_size, param_value, param_value_size_ret, value); | ||
} | ||
case PI_DEVICE_INFO_P2P_READ_DEVICES: { | ||
|
||
std::vector<pi_device> devs; | ||
|
||
for (const auto &dev : device->get_platform()->devices_) { | ||
devs.emplace_back(dev.get()); | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After #4571 this will only report itself as supporting P2P as each device has its own platform, right? If the intention is still to report all devices as they can do explicit P2P memory copies between them, even though some go through host, I wonder if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes. After #4571 this PR no longer acts as a short term fix for enabling direct P2P copies via the cross context route, before the CUDA PI is adapted to allow multiple cuda contexts per sycl context, allowing the preferred intra-context P2P route. #4571 can only be reverted once it becomes possible to do the P2P copy between devices sharing a sycl context. It still has the value in the cross-context optimization you described below - If for whatever reason a user decides to split backend devices/contexts into separate sycl contexts, then we can also optimize this case. Before the change introduced in #4751 is reverted this PR will simply do nothing.
Yes I can try to pick a better name. 'PI_DEVICE_INFO_P2P_READ_DEVICES' returns a set of devices which will be a superset of the set of devices that are capable of doing P2P, since it can include cases that will revert to the Device to Host to Device route. It is also a query which is not exposed to the user, so I think that a slightly longer, uglier, but more descriptive name would be OK: particularly since we don't want to confuse it with future queries: 'PI_DEVICE_INFO_P2P_READ_DEVICES' is probably more appropriate to reserve for a hypothetical future user facing query that tells the user whether a device can access the peers memory directly. |
||
return getInfo(param_value_size, param_value, param_value_size_ret, devs); | ||
} | ||
|
||
// TODO: Investigate if this information is available on CUDA. | ||
case PI_DEVICE_INFO_PCI_ADDRESS: | ||
|
@@ -3648,6 +3657,19 @@ pi_result cuda_piSamplerRelease(pi_sampler sampler) { | |
return PI_SUCCESS; | ||
} | ||
|
||
void copyRectAsserts(const pi_buff_rect_region ®ion, | ||
const pi_buff_rect_offset &src_offset, | ||
const pi_buff_rect_offset &dst_offset, | ||
const CUmemorytype_enum &src_type, | ||
const CUmemorytype_enum &dst_type) { | ||
assert(region != nullptr); | ||
assert(src_offset != nullptr); | ||
assert(dst_offset != nullptr); | ||
|
||
assert(src_type == CU_MEMORYTYPE_DEVICE || src_type == CU_MEMORYTYPE_HOST); | ||
assert(dst_type == CU_MEMORYTYPE_DEVICE || dst_type == CU_MEMORYTYPE_HOST); | ||
} | ||
|
||
/// General 3D memory copy operation. | ||
/// This function requires the corresponding CUDA context to be at the top of | ||
/// the context stack | ||
|
@@ -3660,12 +3682,7 @@ static pi_result commonEnqueueMemBufferCopyRect( | |
const CUmemorytype_enum dst_type, pi_buff_rect_offset dst_offset, | ||
size_t dst_row_pitch, size_t dst_slice_pitch) { | ||
|
||
assert(region != nullptr); | ||
assert(src_offset != nullptr); | ||
assert(dst_offset != nullptr); | ||
|
||
assert(src_type == CU_MEMORYTYPE_DEVICE || src_type == CU_MEMORYTYPE_HOST); | ||
assert(dst_type == CU_MEMORYTYPE_DEVICE || dst_type == CU_MEMORYTYPE_HOST); | ||
copyRectAsserts(region, src_offset, dst_offset, src_type, dst_type); | ||
|
||
src_row_pitch = (!src_row_pitch) ? region->width_bytes + src_offset->x_bytes | ||
: src_row_pitch; | ||
|
@@ -3711,6 +3728,60 @@ static pi_result commonEnqueueMemBufferCopyRect( | |
return PI_CHECK_ERROR(cuMemcpy3DAsync(¶ms, cu_stream)); | ||
} | ||
|
||
/// General 3D memory Peer copy operation. | ||
/// Similar to commonEnqueueMemBufferCopyRect with the addition that two | ||
/// contexts must be specified. | ||
static pi_result commonEnqueueMemBufferCopyRectPeer( | ||
CUstream cu_stream, pi_buff_rect_region region, const void *src_ptr, | ||
const CUmemorytype_enum src_type, pi_buff_rect_offset src_offset, | ||
size_t src_row_pitch, size_t src_slice_pitch, void *dst_ptr, | ||
const CUmemorytype_enum dst_type, pi_buff_rect_offset dst_offset, | ||
size_t dst_row_pitch, size_t dst_slice_pitch, CUcontext dst_context, | ||
CUcontext src_context) { | ||
|
||
copyRectAsserts(region, src_offset, dst_offset, src_type, dst_type); | ||
|
||
src_row_pitch = (!src_row_pitch) ? region->width_bytes : src_row_pitch; | ||
src_slice_pitch = (!src_slice_pitch) ? (region->height_scalar * src_row_pitch) | ||
: src_slice_pitch; | ||
dst_row_pitch = (!dst_row_pitch) ? region->width_bytes : dst_row_pitch; | ||
dst_slice_pitch = (!dst_slice_pitch) ? (region->height_scalar * dst_row_pitch) | ||
: dst_slice_pitch; | ||
|
||
CUDA_MEMCPY3D_PEER params = {}; | ||
|
||
params.WidthInBytes = region->width_bytes; | ||
params.Height = region->height_scalar; | ||
params.Depth = region->depth_scalar; | ||
|
||
params.srcMemoryType = src_type; | ||
params.srcDevice = src_type == CU_MEMORYTYPE_DEVICE | ||
? *static_cast<const CUdeviceptr *>(src_ptr) | ||
: 0; | ||
params.srcHost = src_type == CU_MEMORYTYPE_HOST ? src_ptr : nullptr; | ||
params.srcXInBytes = src_offset->x_bytes; | ||
params.srcY = src_offset->y_scalar; | ||
params.srcZ = src_offset->z_scalar; | ||
params.srcPitch = src_row_pitch; | ||
params.srcHeight = src_slice_pitch / src_row_pitch; | ||
|
||
params.dstMemoryType = dst_type; | ||
params.dstDevice = dst_type == CU_MEMORYTYPE_DEVICE | ||
? *static_cast<CUdeviceptr *>(dst_ptr) | ||
: 0; | ||
params.dstHost = dst_type == CU_MEMORYTYPE_HOST ? dst_ptr : nullptr; | ||
params.dstXInBytes = dst_offset->x_bytes; | ||
params.dstY = dst_offset->y_scalar; | ||
params.dstZ = dst_offset->z_scalar; | ||
params.dstPitch = dst_row_pitch; | ||
params.dstHeight = dst_slice_pitch / dst_row_pitch; | ||
|
||
params.dstContext = dst_context; | ||
params.srcContext = src_context; | ||
|
||
return PI_CHECK_ERROR(cuMemcpy3DPeerAsync(¶ms, cu_stream)); | ||
} | ||
|
||
pi_result cuda_piEnqueueMemBufferReadRect( | ||
pi_queue command_queue, pi_mem buffer, pi_bool blocking_read, | ||
pi_buff_rect_offset buffer_offset, pi_buff_rect_offset host_offset, | ||
|
@@ -3845,7 +3916,17 @@ pi_result cuda_piEnqueueMemBufferCopy(pi_queue command_queue, pi_mem src_buffer, | |
auto src = src_buffer->mem_.buffer_mem_.get() + src_offset; | ||
auto dst = dst_buffer->mem_.buffer_mem_.get() + dst_offset; | ||
|
||
result = PI_CHECK_ERROR(cuMemcpyDtoDAsync(dst, src, size, stream)); | ||
if (src_buffer->context_ == dst_buffer->context_) { | ||
result = PI_CHECK_ERROR(cuMemcpyDtoDAsync(dst, src, size, stream)); | ||
} else { | ||
auto dst_context = dst_buffer->context_->get(); | ||
auto src_context = src_buffer->context_->get(); | ||
|
||
cuCtxEnablePeerAccess(dst_context, 0); | ||
|
||
result = PI_CHECK_ERROR( | ||
cuMemcpyPeerAsync(dst, dst_context, src, src_context, size, stream)); | ||
} | ||
|
||
if (event) { | ||
result = retImplEv->record(); | ||
|
@@ -3889,11 +3970,22 @@ pi_result cuda_piEnqueueMemBufferCopyRect( | |
PI_COMMAND_TYPE_MEM_BUFFER_COPY_RECT, command_queue)); | ||
retImplEv->start(); | ||
} | ||
if (src_buffer->context_ == dst_buffer->context_) { | ||
retErr = commonEnqueueMemBufferCopyRect( | ||
cuStream, region, &srcPtr, CU_MEMORYTYPE_DEVICE, src_origin, | ||
src_row_pitch, src_slice_pitch, &dstPtr, CU_MEMORYTYPE_DEVICE, | ||
dst_origin, dst_row_pitch, dst_slice_pitch); | ||
} else { | ||
auto dstContext = dst_buffer->context_->get(); | ||
auto srcContext = src_buffer->context_->get(); | ||
|
||
retErr = commonEnqueueMemBufferCopyRect( | ||
cuStream, region, &srcPtr, CU_MEMORYTYPE_DEVICE, src_origin, | ||
src_row_pitch, src_slice_pitch, &dstPtr, CU_MEMORYTYPE_DEVICE, | ||
dst_origin, dst_row_pitch, dst_slice_pitch); | ||
cuCtxEnablePeerAccess(dstContext, 0); | ||
|
||
retErr = commonEnqueueMemBufferCopyRectPeer( | ||
cuStream, region, &srcPtr, CU_MEMORYTYPE_DEVICE, src_origin, | ||
src_row_pitch, src_slice_pitch, &dstPtr, CU_MEMORYTYPE_DEVICE, | ||
dst_origin, dst_row_pitch, dst_slice_pitch, dstContext, srcContext); | ||
} | ||
|
||
if (event) { | ||
retImplEv->record(); | ||
|
@@ -4095,6 +4187,54 @@ static pi_result commonEnqueueMemImageNDCopy( | |
return PI_INVALID_VALUE; | ||
} | ||
|
||
/// Similar to commonEnqueueMemImageNDCopy for Peer to Peer copies. | ||
static pi_result commonEnqueueMemImageNDCopyPeer( | ||
CUstream cu_stream, pi_mem_type img_type, const size_t *region, | ||
const void *src_ptr, const CUmemorytype_enum src_type, | ||
const size_t *src_offset, void *dst_ptr, const CUmemorytype_enum dst_type, | ||
const size_t *dst_offset, CUcontext dst_context, CUcontext src_context) { | ||
assert(region != nullptr); | ||
|
||
assert(src_type == CU_MEMORYTYPE_ARRAY || src_type == CU_MEMORYTYPE_HOST); | ||
assert(dst_type == CU_MEMORYTYPE_ARRAY || dst_type == CU_MEMORYTYPE_HOST); | ||
|
||
CUDA_MEMCPY3D_PEER cpyDesc; | ||
memset(&cpyDesc, 0, sizeof(cpyDesc)); | ||
cpyDesc.srcMemoryType = src_type; | ||
if (src_type == CU_MEMORYTYPE_ARRAY) { | ||
cpyDesc.srcArray = *static_cast<const CUarray *>(src_ptr); | ||
cpyDesc.srcXInBytes = src_offset[0]; | ||
cpyDesc.srcY = src_offset[1]; | ||
cpyDesc.srcZ = src_offset[2]; | ||
} else { | ||
cpyDesc.srcDevice = src_type == CU_MEMORYTYPE_DEVICE | ||
? *static_cast<const CUdeviceptr *>(src_ptr) | ||
: 0; | ||
cpyDesc.srcHost = src_type == CU_MEMORYTYPE_HOST ? src_ptr : nullptr; | ||
} | ||
cpyDesc.dstMemoryType = dst_type; | ||
if (dst_type == CU_MEMORYTYPE_ARRAY) { | ||
cpyDesc.dstArray = *static_cast<CUarray *>(dst_ptr); | ||
cpyDesc.dstXInBytes = dst_offset[0]; | ||
cpyDesc.dstY = dst_offset[1]; | ||
cpyDesc.dstZ = dst_offset[2]; | ||
} else { | ||
cpyDesc.dstDevice = dst_type == CU_MEMORYTYPE_DEVICE | ||
? *static_cast<CUdeviceptr *>(dst_ptr) | ||
: 0; | ||
cpyDesc.dstHost = dst_type == CU_MEMORYTYPE_HOST ? dst_ptr : nullptr; | ||
} | ||
cpyDesc.WidthInBytes = region[0]; | ||
cpyDesc.Height = region[1]; | ||
cpyDesc.Depth = region[2]; | ||
cpyDesc.dstContext = dst_context; | ||
cpyDesc.srcContext = src_context; | ||
|
||
return PI_CHECK_ERROR(cuMemcpy3DPeerAsync(&cpyDesc, cu_stream)); | ||
|
||
return PI_INVALID_VALUE; | ||
} | ||
|
||
pi_result cuda_piEnqueueMemImageRead( | ||
pi_queue command_queue, pi_mem image, pi_bool blocking_read, | ||
const size_t *origin, const size_t *region, size_t row_pitch, | ||
|
@@ -4277,17 +4417,37 @@ pi_result cuda_piEnqueueMemImageCopy(pi_queue command_queue, pi_mem src_image, | |
size_t bytesToCopy = elementByteSize * srcArrayDesc.NumChannels * region[0]; | ||
|
||
pi_mem_type imgType = src_image->mem_.surface_mem_.get_image_type(); | ||
if (imgType == PI_MEM_TYPE_IMAGE1D) { | ||
retErr = PI_CHECK_ERROR(cuMemcpyAtoA(dstArray, dstByteOffsetX, srcArray, | ||
srcByteOffsetX, bytesToCopy)); | ||
if (src_image->context_ == dst_image->context_) { | ||
if (imgType == PI_MEM_TYPE_IMAGE1D) { | ||
retErr = PI_CHECK_ERROR(cuMemcpyAtoA(dstArray, dstByteOffsetX, srcArray, | ||
srcByteOffsetX, bytesToCopy)); | ||
} else { | ||
size_t adjustedRegion[3] = {bytesToCopy, region[1], region[2]}; | ||
size_t srcOffset[3] = {srcByteOffsetX, src_origin[1], src_origin[2]}; | ||
size_t dstOffset[3] = {dstByteOffsetX, dst_origin[1], dst_origin[2]}; | ||
|
||
retErr = commonEnqueueMemImageNDCopy( | ||
cuStream, imgType, adjustedRegion, &srcArray, CU_MEMORYTYPE_ARRAY, | ||
srcOffset, &dstArray, CU_MEMORYTYPE_ARRAY, dstOffset); | ||
|
||
if (retErr != PI_SUCCESS) { | ||
return retErr; | ||
} | ||
} | ||
} else { | ||
size_t adjustedRegion[3] = {bytesToCopy, region[1], region[2]}; | ||
size_t srcOffset[3] = {srcByteOffsetX, src_origin[1], src_origin[2]}; | ||
size_t dstOffset[3] = {dstByteOffsetX, dst_origin[1], dst_origin[2]}; | ||
|
||
retErr = commonEnqueueMemImageNDCopy( | ||
auto dstContext = dst_image->context_->get(); | ||
auto srcContext = src_image->context_->get(); | ||
|
||
cuCtxEnablePeerAccess(dstContext, 0); | ||
|
||
retErr = commonEnqueueMemImageNDCopyPeer( | ||
cuStream, imgType, adjustedRegion, &srcArray, CU_MEMORYTYPE_ARRAY, | ||
srcOffset, &dstArray, CU_MEMORYTYPE_ARRAY, dstOffset); | ||
srcOffset, &dstArray, CU_MEMORYTYPE_ARRAY, dstOffset, dstContext, | ||
srcContext); | ||
|
||
if (retErr != PI_SUCCESS) { | ||
return retErr; | ||
|
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
@@ -513,6 +513,9 @@ pi_result piDeviceGetInfo(pi_device Device, pi_device_info ParamName, | |||||||||
// cl_khr_fp64, cl_khr_int64_base_atomics, | ||||||||||
// cl_khr_int64_extended_atomics | ||||||||||
return ReturnValue(""); | ||||||||||
// P2P is currently unsupported in level zero | ||||||||||
case PI_DEVICE_INFO_P2P_READ_DEVICES: | ||||||||||
Comment on lines
+516
to
+517
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Makes the association clearer and more consistent with other cases. Similar applies to sycl/plugins/hip/pi_hip.cpp and sycl/plugins/level_zero/pi_level_zero.cpp. |
||||||||||
return ReturnValue(std::vector<pi_device>{}); | ||||||||||
|
||||||||||
#define UNSUPPORTED_INFO(info) \ | ||||||||||
case info: \ | ||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -517,7 +517,6 @@ void Command::makeTraceEventEpilog() { | |
|
||
Command *Command::processDepEvent(EventImplPtr DepEvent, const DepDesc &Dep) { | ||
const QueueImplPtr &WorkerQueue = getWorkerQueue(); | ||
const ContextImplPtr &WorkerContext = WorkerQueue->getContextImplPtr(); | ||
|
||
// 1. Async work is not supported for host device. | ||
// 2. Some types of commands do not produce PI events after they are enqueued | ||
|
@@ -543,13 +542,7 @@ Command *Command::processDepEvent(EventImplPtr DepEvent, const DepDesc &Dep) { | |
getType() != CommandType::HOST_TASK) | ||
return nullptr; | ||
|
||
ContextImplPtr DepEventContext = DepEvent->getContextImpl(); | ||
// If contexts don't match we'll connect them using host task | ||
if (DepEventContext != WorkerContext && !WorkerContext->is_host()) { | ||
Scheduler::GraphBuilder &GB = Scheduler::getInstance().MGraphBuilder; | ||
ConnectionCmd = GB.connectDepEvent(this, DepEvent, Dep); | ||
} else | ||
MPreparedDepsEvents.push_back(std::move(DepEvent)); | ||
MPreparedDepsEvents.push_back(std::move(DepEvent)); | ||
|
||
return ConnectionCmd; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, it didn't click in my head the last time I reviewed this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wanted to check with you whether ConnectionCmd here has any use cases that I am not aware of before making further changes in this WIP PR. Making ConnectionCmd always nullptr doesn't break any tests using the cuda backend. If ConnectionCmd has no purpose then I can change the function to return void but this requires quite a few cascading changes to other functions that would then also return void. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've investigated what happens when I also make I do not know whether the test coverage is simply insufficient (presumably there should be cases checking that Currently the optimizations do not do anything anyway due to #4571, but #4571 is a temporary fix and is expected to be reverted eventually. It would still be useful to merge this PR, so that once #4571 is reverted the optimisation will be immediately switched on. |
||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a description for this here?