Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Enable Intel GPU support in torchcodec (pytorch xpu backend device) #559

Open
dvrogozh opened this issue Mar 13, 2025 · 6 comments · May be fixed by #558
Open

[RFC] Enable Intel GPU support in torchcodec (pytorch xpu backend device) #559

dvrogozh opened this issue Mar 13, 2025 · 6 comments · May be fixed by #558

Comments

@dvrogozh
Copy link
Contributor

dvrogozh commented Mar 13, 2025

That's an RFC issue for the following PR:

The PR enables Intel GPU support in torchcodec by adding ffmpeg-vaapi decoding and connecting it with pytorch XPU device backend.

There are few items I would like to bring to discussion in this RFC issue.

Selection of ffmpeg backend. There are few which are available for Intel GPU:

  • ffmpeg-vaapi
  • ffmpeg-dx11/dx12
  • ffmpeg-qsv (libvpl based one)
  • ffmpeg-vulkan

None of the above media backends are preferable in the sense of easier context/memory sharing. Intel media APIs, drivers and libraries are not directly intersect with Intel compute stack. In particular, Intel compute's Unified Shared Memory pointers are note recognized by media APIs and can not be accepted directly. This means that memory sharing between media and compute must be done (at the moment) via lower level APIs such as DMA fds on Linux and NT handles on Windows. This gives OS specific dependency.

I suggest to consider ffmpeg-vaapi for Linux and ffmpeg-dx12 for Windows to enable Intel GPUs. These are backends based on Intel media driver APIs. ffmpeg-qsv is based on a higher level library (libvpl) and does not allow to avoid vaapi/dx dependency since we in any case will need to use these APIs to get to the underlying surface memory. I also think that ffmpeg-vaapi is used by AMD GPUs, thus adding support of VAAPI can help here as well. (See also #444)

ffmpeg-vulkan might be interesting due to eventual cross-vendor capabilities. However 1) media support in vulkan is recently new and I am not sure that all required features are available (for example - color space conversion to RGBA), 2) media support in vulkan driver is a community effort for Intel GPUs rather than Intel effort and has different implementation comparing to Intel media driver. Overall, I think ffmpeg-vulkan might be an interesting next stepping after enabling torchcodec with ffmpeg-vaapi.

Selection of color conversion algorithm. Following current torchcodec architecture color conversion of decoded output (typically NV12) to RGB24 is needed. In the current implementation I chose to just implement color conversion directly on VAAPI since that's fairly quick and trivial. I believe that's good enough for the first implementation. I do suggest to consider using ffmpeg-vaapi for color space conversion going forward, but this will require additional effort on top of the PR I currently provide. Couple notes on conversion:

  • Current Intel media APIs do not support RGB24 since it is considered suboptimal format (due to odd alignment), so Intel media APIs support only RGB32 formats. For that reason I added slicing of the output RGB32 surface and copying it to the final output surface.
  • Conversion algorithms might give different results. Current tests rely on per-pixel absolute/relative difference. To handle bigger difference in converted output vs. what CUDA has, I used PSNR metric thru torcheval for checks in tests (changed only for Intel GPU).

Planned changes.

CC: @scotts, @NicolasHug, @EikanWang

@scotts
Copy link
Contributor

scotts commented Mar 22, 2025

@dvrogozh, thanks for the interest in TorchCodec! Supporting multiple accelerators is exciting and something that will benefit our users. At the moment, I think the best way forward is to host this XPU support in a separate repo as a dynamically loadable module.

What your PR clearly demonstrates is that TorchCodec is not quite ready for that - it's close, but not quite there. That is, you needed to change our DeviceInterface itself, the CPUOnlyDevice and then modify device dispatch logic inside of the VideoDecoder. I think the ideal situation is that an external XPU-only repo would be able to build a dynamically loadable shared library that just implements a new XpuDevice and does not need to modify any code in the current TorchCodec repo.

In order to get there, I think we need to take three steps:

  1. Make what is currently DeviceInteface truly generic. Its API should not have any single-device-specific APIs. This means that "CPU", "CUDA" and "XPU" should not appear in any of the names. The API should be sufficient to cover all device-specifc things that need to happen. Since our N=3 now, we should have confidence we can define such an interface.
  2. Move the CPU code that is currently baked into VideoDecoder into CPUDevice.
  3. Modify the C++ VideoDecoder so that it dispatches to the appropriate implementation of DeviceInterface at runtime. That is, we should not have any more if device.type() == torch::kDEVICE checks in the decoder code. All places where we currently do that should look more like currentDevice->doDeviceSpecificThing().

I'm happy to work with you on the changes we'll need to make to the TorchCodec repo to enable a separate XPU device repo. Let me know if that's work you're interested in tackling!

@dvrogozh
Copy link
Contributor Author

@scotts : thank you for feedback. I will be glad to work with you on the part to prepare TorchCodec to support multiple accelerators - those steps you are highlighting in1-3) and others. Let's try to take next steps next week taking and refactoring pieces from the PR I provided.

However, I do disagree at the point that other accelerator's specific code should be placed in a separate repository. We had this experience at Intel with few other projects, including IPEX (torch plugin for intel GPU support) and it's clearly understood that this path has a number of issues. Such external project goes out of sync with upstream project, becomes difficult to maintain and most importantly is not appreciated by the end users who considers additional steps to install plugin stuff as unnecessary complexity.

Ultimately we are looking for the way to bring Intel GPUs support right into upstream TorchCodec itself, in a similar way as it's now added into upstream pytorch via XPU backend. We (Intel) will backup this effort with Intel-specific aspects of CI and packaging. I hope we can discuss what will be the best way towards achieving this goal. Can you, please, share you opinion on feasibility of this approach? I believe the 1-3) steps you shared above would be needed anyway. The part which probably needs discussion is a choice of ffmpeg backend since there are few available with their pros and cons. And there might be other technical requirements which you might wish to imply for the multi-accelerator support.

@scotts
Copy link
Contributor

scotts commented Mar 28, 2025

@dvrogozh, as your PR is quite large, I think it may be easiest to start fresh. I'd also like to create a new issue focused on making our device interface generic and dynamically loadable. Let me know if you feel you have enough of an understanding of what needs to happen to start making progress, or if it would be valuable for me to sketch things out more.

I agree that a separate repo adds friction for users who want to use TorchCodec and XPU acceleration. However, we don't have much signal right now how much our users want XPU acceleration. I think that a separate repo is the best way to get that signal: if a lot of users start complaining about that friction, then we know it's wanted! Once we have clear demand, we can work on the best way to upstream XPU support directly in TorchCodec.

@dvrogozh
Copy link
Contributor Author

@scotts : I am working on the 1st item from your list (device agnostic DeviceInterface) at the moment. I will open another PR dedicated to that (and only that, no XPU) once I am done. Hope to do that either today or worst case Monday next week.

@dvrogozh
Copy link
Contributor Author

I'd also like to create a new issue focused on making our device interface generic

Filed:

I will open another PR dedicated to that once I am done.

@scotts : PR filed. Please, take a look:

@dvrogozh
Copy link
Contributor Author

dvrogozh commented Apr 4, 2025

@scotts : I am working on the 1st item from your list (device agnostic DeviceInterface) at the moment.

Finished with #606 (device agnostic DeviceInterface). Next steps:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants