Skip to content

[CUDA] P2P buffer/image memory copy #4401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 26 commits into from
Closed
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
4a71379
Implemented P2P copies for the cuda backend using buffers.
Aug 12, 2021
c771093
Switched to using vendor name in P2P info query.
Aug 13, 2021
b4a99ff
Merge branch 'intel:sycl' into P2P
JackAKirk Aug 13, 2021
b38d5e2
Merge branch 'intel:sycl' into P2P
JackAKirk Aug 23, 2021
6abe9fb
Corrected the scoped context in guessLocalWorkSize to prevent stale c…
Aug 23, 2021
a3c251e
Added binary device query for P2P memcpy instead of platform query.
JackAKirk Aug 25, 2021
3cd6911
Corrected formatting.
Aug 25, 2021
c384fbe
Corrected Formatting.
Aug 25, 2021
27c073d
Placed new PI API's after piTearDown.
Aug 25, 2021
60f276d
Renamed piextP2P as piextDevicesSupportP2P.
Aug 25, 2021
835e5c4
Made check that devices backends match before P2P query.
Aug 25, 2021
9e3ef85
Corrected formating in graph_builder.cpp.
Aug 25, 2021
0f36b94
Merge branch 'intel:sycl' into P2P
JackAKirk Aug 26, 2021
0d819c0
Replaced binary device query with device_info call returning a vector…
Aug 26, 2021
43f970c
Removed piext Peer functions, replaced them with existing PI copy calls.
Aug 31, 2021
9b9a21f
Fixed formatting issues following previous commit.
Aug 31, 2021
52cae9f
P2P copies made for 1D image arrays again.
Sep 1, 2021
24b240a
Return retError from call to commonEnqueueMemImageNDCopyPeer.
Sep 1, 2021
99e58f1
Removed all changes to memory_manager.cpp:
Sep 2, 2021
fd15910
Superficial change to make the memory_manager.cpp diff empty.
Sep 2, 2021
ccba446
Applied stylistic/general improvements.
Sep 21, 2021
0f95aa7
Merge branch 'sycl' into P2P
JackAKirk Sep 23, 2021
7057849
Reverted change to guessLocalWorkSize: unnecessary since #4606.
Sep 23, 2021
9d5a84a
Merge branch 'intel:sycl' into P2P
JackAKirk Sep 30, 2021
b64484d
Implemented PI_DEVICE_INFO_P2P_READ_DEVICES piDeviceGetInfo case in o…
JackAKirk Oct 1, 2021
ebfdb2a
Merge branch 'intel:sycl' into P2P
JackAKirk Oct 1, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions sycl/include/CL/sycl/detail/pi.def
Original file line number Diff line number Diff line change
Expand Up @@ -137,4 +137,9 @@ _PI_API(piextPluginGetOpaqueData)

_PI_API(piTearDown)

_PI_API(piextEnqueueMemBufferCopyPeer)
_PI_API(piextEnqueueMemBufferCopyRectPeer)
_PI_API(piextEnqueueMemImageCopyPeer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need these new API? why wouldn't regular copy API perform P2P copies transparently under the hood?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regular API takes a single pi_queue whereas the Peer API requires a second queue as an argument (principally so that the second context is known). The regular API is an OpenCL interface so cannot be changed. I think that a single API could be used if the new piext*** API was used in the runtime to replace the regular copy API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Peer API requires a second queue as an argument (principally so that the second context is known).

The pi_mem src & dst are created with interfaces that have context, e.g. piextUSMDeviceAlloc or piMemBufferCreate, so backends already know the context of both src and dst, and can act on that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I missed that when I checked out pi_mem. There is another reason for providing both queues from e.g. this snippet in
cuda_piextEnqueueMemBufferCopyRectPeer in pi_cuda.cpp line 4054:

  try {
    ScopedContext active(dst_queue->get_context());
    if (event_wait_list) {
      retErr = cuda_piEnqueueEventsWait(src_queue, num_events_in_wait_list,
                                        event_wait_list, nullptr);
    }

    if (event) {
      retImplEv = std::unique_ptr<_pi_event>(_pi_event::make_native(
          PI_COMMAND_TYPE_MEM_BUFFER_COPY_RECT, dst_queue));
      retImplEv->start();
    }

We wait on events associated with the source queue and return the event associated with the dest queue.
There were problems associated with returning the event associated with the src queue. Since the contexts can be found from pi_mem I will look at the again and see if there is a way of doing things without the second queue argument.

Copy link
Contributor Author

@JackAKirk JackAKirk Aug 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that it is fine to only pass a single queue in the PI which acts as the command_queue, and it is fine for either the src_queue or the dst_queue to act as the command queue.

It is my current understanding that all implementation details of implicit peer to peer memory copy calls for buffer memory between devices sharing a SYCL context should be dealt with by the PI, such that the only implicit peer to peer memory copy case that should be dealt with by the runtime (via memory_manager) is the cross context case.

I will implement the peer to peer via a call to piEnqueueMemBufferCopy from memory_manager as suggested.

Copy link
Contributor Author

@JackAKirk JackAKirk Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now made the changes that I described above, implementing the peer to peer copy via a call to piEnqueueMemBufferCopy from memory_manager as suggested.

_PI_API(piextP2P)

#undef _PI_API
27 changes: 27 additions & 0 deletions sycl/include/CL/sycl/detail/pi.h
Original file line number Diff line number Diff line change
Expand Up @@ -1048,6 +1048,19 @@ __SYCL_EXPORT pi_result piQueueGetInfo(pi_queue command_queue,
void *param_value,
size_t *param_value_size_ret);

__SYCL_EXPORT pi_result piextEnqueueMemBufferCopyPeer(
pi_queue src_queue, pi_mem src_buffer, pi_queue dst_queue,
pi_mem dst_buffer, size_t src_offset, size_t dst_offset, size_t size,
pi_uint32 num_events_in_wait_list, const pi_event *event_wait_list,
pi_event *event);

/// p2p is set true if PI API's,
/// piextEnqueueMemBufferCopyPeer/piextEnqueueMemBufferCopyRectPeer/piextEnqueueMemImageCopyPeer,
/// for peer to peer memory copy may be called.
///
__SYCL_EXPORT pi_result piextP2P(pi_device src_device, pi_device dst_device,
bool *p2p);

__SYCL_EXPORT pi_result piQueueRetain(pi_queue command_queue);

__SYCL_EXPORT pi_result piQueueRelease(pi_queue command_queue);
Expand Down Expand Up @@ -1452,6 +1465,14 @@ __SYCL_EXPORT pi_result piEnqueueMemBufferCopyRect(
pi_uint32 num_events_in_wait_list, const pi_event *event_wait_list,
pi_event *event);

__SYCL_EXPORT pi_result piextEnqueueMemBufferCopyRectPeer(
pi_queue src_queue, pi_mem src_buffer, pi_queue dst_queue,
pi_mem dst_buffer, pi_buff_rect_offset src_origin,
pi_buff_rect_offset dst_origin, pi_buff_rect_region region,
size_t src_row_pitch, size_t src_slice_pitch, size_t dst_row_pitch,
size_t dst_slice_pitch, pi_uint32 num_events_in_wait_list,
const pi_event *event_wait_list, pi_event *event);

__SYCL_EXPORT pi_result
piEnqueueMemBufferFill(pi_queue command_queue, pi_mem buffer,
const void *pattern, size_t pattern_size, size_t offset,
Expand All @@ -1477,6 +1498,12 @@ __SYCL_EXPORT pi_result piEnqueueMemImageCopy(
pi_image_region region, pi_uint32 num_events_in_wait_list,
const pi_event *event_wait_list, pi_event *event);

__SYCL_EXPORT pi_result piextEnqueueMemImageCopyPeer(
pi_queue command_queue, pi_mem src_image, pi_queue dst_queue,
pi_mem dst_image, pi_image_offset src_origin, pi_image_offset dst_origin,
pi_image_region region, pi_uint32 num_events_in_wait_list,
const pi_event *event_wait_list, pi_event *event);

__SYCL_EXPORT pi_result
piEnqueueMemImageFill(pi_queue command_queue, pi_mem image,
const void *fill_color, const size_t *origin,
Expand Down
Loading