Allow option to use the swscale library for color conversion instead of filtergraph #205

ahmadsharif1 · 2024-09-09T18:32:37Z

This PR does the following:

Refactor the VideoDecoder.cpp code so we can decouple frame allocation from decoding and color-conversion.
Allow the option of using sws_scale for color conversion instead of filtergraph. sws_scale is faster than filtergraph at color converting smaller videos. Moreover the setup and destruction time overhead of filtergraph is much larger than sws_scale's setup and destruction time.
benchmark_decoders.py: Added some code to the benchmark to print more detailed stats if a flag is turned on.
benchmark_decoders.py: Added code to pass in options to torchcodec decoders.
Use bilinear interpolation whenever we are scaling. I will test this against torchvision's bilinear scaling in a subsequent diff.

Benchmark results show performance improvement:

python benchmarks/decoders/benchmark_decoders.py --bm_video_paths=/home/ahmads/personal/lerobot/benchmarks/video/outputs/video_benchmark/videos/libx264_yuv420p_None_None/lerobot_pusht_image.mp4,/home/ahmads/personal/lerobot/benchmarks/video/outputs/video_benchmark/videos/libx265_yuv420p_None_None/lerobot_pusht_image.mp4,/home/ahmads/jupyter/carmel1.mp4 --decoders=torchvision,tcoptions,tcoptions:color_conversion_library=swsscale+num_threads=1,tcoptions:color_conversion_library=filtergraph+num_threads=1,tcbatchoptions,tcbatchoptions:num_
threads=1+color_conversion_library=swsscale,tcbatchoptions:num_threads=1+color_conversion_library=filtergraph --bm_video_speed_min_run_seconds=10
[ video=/home/ahmads/personal/lerobot/benchmarks/video/outputs/video_benchmark/videos/libx264_yuv420p_None_None/lerobot_pusht_image.mp4 h264 96x96, 16.1s 10.0fps ]
                                                                                     |  10 seek()+next()  |  1 next()  |  10 next()
1 threads: ------------------------------------------------------------------------------------------------------------------------
      TorchcodecNonCompiled:color_conversion_library=filtergraph+num_threads=1       |        12.5        |    4.5     |     5.2   
      TVNewAPIDecoderWithBackendVideoReader                                          |        66.8        |    4.4     |     5.8   
      TorchcodecNonCompiled:color_conversion_library=swsscale+num_threads=1          |         8.9        |    1.1     |     1.8   
      TorchCodecNonCompiledBatch:num_threads=1+color_conversion_library=filtergraph  |        12.6        |    4.9     |     5.8   
      TorchCodecNonCompiledBatch:num_threads=1+color_conversion_library=swsscale     |         8.7        |    1.5     |     2.1   

Times are in milliseconds (ms).

[ video=/home/ahmads/personal/lerobot/benchmarks/video/outputs/video_benchmark/videos/libx265_yuv420p_None_None/lerobot_pusht_image.mp4 hevc 96x96, 16.1s 10.0fps ]
                                                                                     |  10 seek()+next()  |  1 next()  |  10 next()
1 threads: ------------------------------------------------------------------------------------------------------------------------
      TorchcodecNonCompiled:color_conversion_library=filtergraph+num_threads=1       |        13.5        |    4.2     |     5.5   
      TVNewAPIDecoderWithBackendVideoReader                                          |        68.7        |    3.8     |     5.2   
      TorchcodecNonCompiled:color_conversion_library=swsscale+num_threads=1          |         9.7        |    1.0     |     1.8   
      TorchCodecNonCompiledBatch:num_threads=1+color_conversion_library=filtergraph  |        13.9        |    4.6     |     5.6   
      TorchCodecNonCompiledBatch:num_threads=1+color_conversion_library=swsscale     |         9.2        |    1.4     |     2.0   

Times are in milliseconds (ms).

[------------------------------- video=/home/ahmads/jupyter/carmel1.mp4 h264 640x360, 1.3s 30.0fps -------------------------------]
                                                                                     |  10 seek()+next()  |  1 next()  |  10 next()
1 threads: ------------------------------------------------------------------------------------------------------------------------
      TorchcodecNonCompiled:color_conversion_library=filtergraph+num_threads=1       |        44.6        |    9.2     |     27.5  
      TVNewAPIDecoderWithBackendVideoReader                                          |       255.2        |    9.7     |     28.3  
      TorchcodecNonCompiled:color_conversion_library=swsscale+num_threads=1          |        40.8        |    4.5     |     14.0  
      TorchCodecNonCompiledBatch:num_threads=1+color_conversion_library=filtergraph  |        37.4        |    8.7     |     17.9  
      TorchCodecNonCompiledBatch:num_threads=1+color_conversion_library=swsscale     |        31.7        |    4.0     |     12.5  

Times are in milliseconds (ms).

[ video=/home/ahmads/personal/lerobot/benchmarks/video/outputs/video_benchmark/videos/libx264_yuv420p_None_None/lerobot_pusht_image.mp4 h264 640x360, 1.3s 30.0fps ]
                             |  create()+next()
1 threads: ------------------------------------
      TorchcodecNonCompiled  |        11.6     

Times are in milliseconds (ms).

NicolasHug

Thanks for the PR, I made a quick review. I have further questions / discussion points that I'll take offline for now

benchmarks/decoders/benchmark_decoders.py

src/torchcodec/decoders/_core/VideoDecoder.h

NicolasHug · 2024-09-11T09:21:48Z

src/torchcodec/decoders/_core/VideoDecoder.cpp

+      auto allocateToConvertDone =
+          std::chrono::duration_cast<std::chrono::microseconds>(
+              convertDone - allocateDone);
+      auto total = std::chrono::duration_cast<std::chrono::microseconds>(


Do we want such benchmarking code within the decoder? I think it makes the code significantly harder to read.

I will refactor this in a different diff. It's probably not that bad. A self-profiling decoder is great imo

Things tend to look more obvious in the eyes of the code author :)

In this block there are 12 statements, and 5 of those are dedicated to the core logic. The 7 remaining ones are about benchmarking.

Do we expect such benchmark info to be forever useful, or is this more of a one-time thing?

Before leaving this out for a follow-up diff, I would suggest to protect the benchmarking logic within #ifdef blocks, so that they can easily be collapsed on the IDE to ease code review / reading.

I think benchmark would be useful forever.

I think #ifdefs may make it even harder to read.

Later on I plan to add a timing class that will have its own code #ifdefs commented out for non-developer builds.

Should I really add #ifdefs in this diff? Leaving this unresolved.

I actually agree with @NicolasHug that this code is now hard to read. My reasons, which some are the same as what Nicolas said:

The timing code takes up more lines than the actual logic.

Because we're using auto as the type, I have to actually read each timing line to realize it's a timing line. I can't just skim over them and focus on the actual logic. (Unfortunately, explicitly stating the return type would be even noisier.)

We're not using any new lines to group code together. "Paragraphs" of code would help me figure out what I can ignore as timing, as I expect a paragraph of code to have a start and stop timing call.

This timing code is very granular. I understand that in this PR, the amount of time just a few statements took was key to getting a big perf win, but I don't think that necessarily means we should keep it all the time. I consider it similar to putting in print statements when debugging. Some print statements graduate to actual logging, but most don't, and we delete most debugging print statements at commit time.

I would prefer either:

We remove the timing code in this PR, and wait until we have a readable abstraction to commit it. OR:

We limit the timing code to function boundaries. Yes, that will not be enough to fully diagnose perf problems, but it's enough to know where to start digging.

Disagree and commit. Removed those lines

NicolasHug · 2024-09-11T09:24:12Z

src/torchcodec/decoders/_core/VideoDecoder.cpp

+  // There is a bug in sws_scale where it doesn't handle non-multiple of 16
+  // widths.
+  // https://stackoverflow.com/questions/74351955/turn-off-sw-scale-conversion-to-planar-yuv-32-byte-alignment-requirements
+  // In that case we are forced to use a filtergraph to do the color conversion.


did that bug affect the benchmarks in any way?
Also, at what level do we want the user to know: docs, warnings, error?

At the moment we wont warn anyone since there are no users. Idealy we don't warn and just do the right thing under the AUTO enum

src/torchcodec/decoders/_core/VideoDecoder.cpp

src/torchcodec/decoders/_core/VideoDecoder.h

test/decoders/test_video_decoder_ops.py

NicolasHug

Thanks for the PR, I made a quick review. I have further questions / discussion points that I'll take offline for now

src/torchcodec/decoders/_core/VideoDecoder.h

test/decoders/test_video_decoder_ops.py

scotts · 2024-09-13T12:22:38Z

src/torchcodec/decoders/_core/VideoDecoder.cpp

@@ -325,7 +339,7 @@ void VideoDecoder::initializeFilterGraphForStream(
    width = *options.width;
    height = *options.height;
  }
-  std::snprintf(description, sizeof(description), "scale=%d:%d", width, height);
+  std::snprintf(description, sizeof(description), "scale=%d:%d:sws_flags=bilinear", width, height);


Not a priority now, but we can do this in a safer way with std::stringstream; it won't require allocating a static sized buffer.

scotts · 2024-09-23T19:54:27Z

benchmarks/decoders/benchmark_decoders.py

        frames = []
        for pts in pts_list:
            reader.seek(pts)
            frame = next(reader)
            frames.append(frame["data"].permute(1, 2, 0))
+        frames_done = timeit.default_timer()
+        if self._print_each_iteration_time:
+            del reader


Are you intending to time how long it takes to do the deallocations of memory on the C++ side? Note that del reader on the Python side will only decrement a reference counter. You can make an explicit call to the garbage collector (see https://docs.python.org/3/library/gc.html#gc.collect), but that's going to do a full collection of all garbage. If you're trying to time how long it takes to deallocate the objects on the C++ side, I don't know if there's a way to do that reliably from Python.

This code was for my own debugging/edification and is turned off by default. A private variable controls it.

I can delete it if you want. It's developer-only code -- not user-facing.

Let me know.

I timed the code btw -- when using pytorch to time it, the aggregate profile doesn't change when I call del decoder on it. It's just that if you print the iters, you can see a timing difference once in every few iters.

I think the benchmark is easier to understand without it, honestly. If we keep it, then we need a comment explaining that we know this would happen anyway when the function exits, and this does not actually cause the GC to happen, but doing this here before the function exists allows us to sometimes see the cost of GC.

I removed the del line now

src/torchcodec/decoders/_core/VideoDecoder.cpp

scotts · 2024-09-24T16:43:13Z

src/torchcodec/decoders/_core/VideoDecoder.h

-
+  enum ColorConversionLibrary {
+    // TODO: Add an AUTO option later.
+    // Use the libswscale library for color conversion.


Comment looks wrong; should say something about filtergraph.

Good catch. Done.

scotts · 2024-09-24T16:53:53Z

test/convert_image_to_tensor.py

-    # Save tensor to disk
-    torch.save(img_tensor, base_filename + ".pt", _use_new_zipfile_serialization=True)
+    extension = os.path.splitext(img_file)[1]
+    if extension == ".pt":


I'm confused - the stated purpose of this script is to convert images (stored as .bmps) to tensors (stored as .pts). This path seems like the opposite: going from tensors to images. Where do we use this behavior? Since this behavior switch is based only on the filename, I can see this biting us in the future. Can we separate this out into a separate script, or make this an explicit script option?

I can revert this but I want this script to convert from image to tensor and from tensor to image. As coded it does the action based on the extension automatically. It's not easy to get wrong -- I find manual options are sometimes easier to get wrong. WDTY?

I reverted this change. Can add it back in a different diff

scotts · 2024-09-24T16:55:54Z

test/decoders/test_video_decoder_ops.py

+    def test_color_conversion_library(self, color_conversion_library):
+        decoder = create_from_file(str(NASA_VIDEO.path))
+        _add_video_stream(decoder, color_conversion_library=color_conversion_library)
+        frame0, _, _ = get_next_frame(decoder)


frame0, *_ = get_next_frame(decoder) will also work, and is more future-proof.

Great idea. Done.

facebook-github-bot · 2024-09-24T18:35:22Z