Skip to content

Allow option to use the swscale library for color conversion instead of filtergraph #205

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Sep 26, 2024

Conversation

ahmadsharif1
Copy link
Contributor

@ahmadsharif1 ahmadsharif1 commented Sep 9, 2024

This PR does the following:

  1. Refactor the VideoDecoder.cpp code so we can decouple frame allocation from decoding and color-conversion.
  2. Allow the option of using sws_scale for color conversion instead of filtergraph. sws_scale is faster than filtergraph at color converting smaller videos. Moreover the setup and destruction time overhead of filtergraph is much larger than sws_scale's setup and destruction time.
  3. benchmark_decoders.py: Added some code to the benchmark to print more detailed stats if a flag is turned on.
  4. benchmark_decoders.py: Added code to pass in options to torchcodec decoders.
  5. Use bilinear interpolation whenever we are scaling. I will test this against torchvision's bilinear scaling in a subsequent diff.

Benchmark results show performance improvement:

python benchmarks/decoders/benchmark_decoders.py --bm_video_paths=/home/ahmads/personal/lerobot/benchmarks/video/outputs/video_benchmark/videos/libx264_yuv420p_None_None/lerobot_pusht_image.mp4,/home/ahmads/personal/lerobot/benchmarks/video/outputs/video_benchmark/videos/libx265_yuv420p_None_None/lerobot_pusht_image.mp4,/home/ahmads/jupyter/carmel1.mp4 --decoders=torchvision,tcoptions,tcoptions:color_conversion_library=swsscale+num_threads=1,tcoptions:color_conversion_library=filtergraph+num_threads=1,tcbatchoptions,tcbatchoptions:num_
threads=1+color_conversion_library=swsscale,tcbatchoptions:num_threads=1+color_conversion_library=filtergraph --bm_video_speed_min_run_seconds=10
[ video=/home/ahmads/personal/lerobot/benchmarks/video/outputs/video_benchmark/videos/libx264_yuv420p_None_None/lerobot_pusht_image.mp4 h264 96x96, 16.1s 10.0fps ]
                                                                                     |  10 seek()+next()  |  1 next()  |  10 next()
1 threads: ------------------------------------------------------------------------------------------------------------------------
      TorchcodecNonCompiled:color_conversion_library=filtergraph+num_threads=1       |        12.5        |    4.5     |     5.2   
      TVNewAPIDecoderWithBackendVideoReader                                          |        66.8        |    4.4     |     5.8   
      TorchcodecNonCompiled:color_conversion_library=swsscale+num_threads=1          |         8.9        |    1.1     |     1.8   
      TorchCodecNonCompiledBatch:num_threads=1+color_conversion_library=filtergraph  |        12.6        |    4.9     |     5.8   
      TorchCodecNonCompiledBatch:num_threads=1+color_conversion_library=swsscale     |         8.7        |    1.5     |     2.1   

Times are in milliseconds (ms).

[ video=/home/ahmads/personal/lerobot/benchmarks/video/outputs/video_benchmark/videos/libx265_yuv420p_None_None/lerobot_pusht_image.mp4 hevc 96x96, 16.1s 10.0fps ]
                                                                                     |  10 seek()+next()  |  1 next()  |  10 next()
1 threads: ------------------------------------------------------------------------------------------------------------------------
      TorchcodecNonCompiled:color_conversion_library=filtergraph+num_threads=1       |        13.5        |    4.2     |     5.5   
      TVNewAPIDecoderWithBackendVideoReader                                          |        68.7        |    3.8     |     5.2   
      TorchcodecNonCompiled:color_conversion_library=swsscale+num_threads=1          |         9.7        |    1.0     |     1.8   
      TorchCodecNonCompiledBatch:num_threads=1+color_conversion_library=filtergraph  |        13.9        |    4.6     |     5.6   
      TorchCodecNonCompiledBatch:num_threads=1+color_conversion_library=swsscale     |         9.2        |    1.4     |     2.0   

Times are in milliseconds (ms).

[------------------------------- video=/home/ahmads/jupyter/carmel1.mp4 h264 640x360, 1.3s 30.0fps -------------------------------]
                                                                                     |  10 seek()+next()  |  1 next()  |  10 next()
1 threads: ------------------------------------------------------------------------------------------------------------------------
      TorchcodecNonCompiled:color_conversion_library=filtergraph+num_threads=1       |        44.6        |    9.2     |     27.5  
      TVNewAPIDecoderWithBackendVideoReader                                          |       255.2        |    9.7     |     28.3  
      TorchcodecNonCompiled:color_conversion_library=swsscale+num_threads=1          |        40.8        |    4.5     |     14.0  
      TorchCodecNonCompiledBatch:num_threads=1+color_conversion_library=filtergraph  |        37.4        |    8.7     |     17.9  
      TorchCodecNonCompiledBatch:num_threads=1+color_conversion_library=swsscale     |        31.7        |    4.0     |     12.5  

Times are in milliseconds (ms).

[ video=/home/ahmads/personal/lerobot/benchmarks/video/outputs/video_benchmark/videos/libx264_yuv420p_None_None/lerobot_pusht_image.mp4 h264 640x360, 1.3s 30.0fps ]
                             |  create()+next()
1 threads: ------------------------------------
      TorchcodecNonCompiled  |        11.6     

Times are in milliseconds (ms).

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 9, 2024
@ahmadsharif1 ahmadsharif1 changed the title Use sws_scale for color conversion when the width is a multiple of 32 Allow option to use sws_scale for color conversion Sep 10, 2024
@ahmadsharif1 ahmadsharif1 marked this pull request as ready for review September 10, 2024 20:44
@ahmadsharif1 ahmadsharif1 requested a review from scotts September 10, 2024 20:44
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, I made a quick review. I have further questions / discussion points that I'll take offline for now

auto allocateToConvertDone =
std::chrono::duration_cast<std::chrono::microseconds>(
convertDone - allocateDone);
auto total = std::chrono::duration_cast<std::chrono::microseconds>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want such benchmarking code within the decoder? I think it makes the code significantly harder to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will refactor this in a different diff. It's probably not that bad. A self-profiling decoder is great imo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things tend to look more obvious in the eyes of the code author :)

In this block there are 12 statements, and 5 of those are dedicated to the core logic. The 7 remaining ones are about benchmarking.

Do we expect such benchmark info to be forever useful, or is this more of a one-time thing?

Before leaving this out for a follow-up diff, I would suggest to protect the benchmarking logic within #ifdef blocks, so that they can easily be collapsed on the IDE to ease code review / reading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think benchmark would be useful forever.

I think #ifdefs may make it even harder to read.

Later on I plan to add a timing class that will have its own code #ifdefs commented out for non-developer builds.

Should I really add #ifdefs in this diff? Leaving this unresolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually agree with @NicolasHug that this code is now hard to read. My reasons, which some are the same as what Nicolas said:

  1. The timing code takes up more lines than the actual logic.
  2. Because we're using auto as the type, I have to actually read each timing line to realize it's a timing line. I can't just skim over them and focus on the actual logic. (Unfortunately, explicitly stating the return type would be even noisier.)
  3. We're not using any new lines to group code together. "Paragraphs" of code would help me figure out what I can ignore as timing, as I expect a paragraph of code to have a start and stop timing call.

This timing code is very granular. I understand that in this PR, the amount of time just a few statements took was key to getting a big perf win, but I don't think that necessarily means we should keep it all the time. I consider it similar to putting in print statements when debugging. Some print statements graduate to actual logging, but most don't, and we delete most debugging print statements at commit time.

I would prefer either:

  1. We remove the timing code in this PR, and wait until we have a readable abstraction to commit it. OR:
  2. We limit the timing code to function boundaries. Yes, that will not be enough to fully diagnose perf problems, but it's enough to know where to start digging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disagree and commit. Removed those lines

Comment on lines 422 to 425
// There is a bug in sws_scale where it doesn't handle non-multiple of 16
// widths.
// https://stackoverflow.com/questions/74351955/turn-off-sw-scale-conversion-to-planar-yuv-32-byte-alignment-requirements
// In that case we are forced to use a filtergraph to do the color conversion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did that bug affect the benchmarks in any way?
Also, at what level do we want the user to know: docs, warnings, error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment we wont warn anyone since there are no users. Idealy we don't warn and just do the right thing under the AUTO enum

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, I made a quick review. I have further questions / discussion points that I'll take offline for now

@ahmadsharif1 ahmadsharif1 changed the title Allow option to use sws_scale for color conversion Allow option to use the swscale library for color conversion instead of filtergraph Sep 11, 2024
@@ -325,7 +339,7 @@ void VideoDecoder::initializeFilterGraphForStream(
width = *options.width;
height = *options.height;
}
std::snprintf(description, sizeof(description), "scale=%d:%d", width, height);
std::snprintf(description, sizeof(description), "scale=%d:%d:sws_flags=bilinear", width, height);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a priority now, but we can do this in a safer way with std::stringstream; it won't require allocating a static sized buffer.

frames = []
for pts in pts_list:
reader.seek(pts)
frame = next(reader)
frames.append(frame["data"].permute(1, 2, 0))
frames_done = timeit.default_timer()
if self._print_each_iteration_time:
del reader
Copy link
Contributor

@scotts scotts Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you intending to time how long it takes to do the deallocations of memory on the C++ side? Note that del reader on the Python side will only decrement a reference counter. You can make an explicit call to the garbage collector (see https://docs.python.org/3/library/gc.html#gc.collect), but that's going to do a full collection of all garbage. If you're trying to time how long it takes to deallocate the objects on the C++ side, I don't know if there's a way to do that reliably from Python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was for my own debugging/edification and is turned off by default. A private variable controls it.

I can delete it if you want. It's developer-only code -- not user-facing.

Let me know.

I timed the code btw -- when using pytorch to time it, the aggregate profile doesn't change when I call del decoder on it. It's just that if you print the iters, you can see a timing difference once in every few iters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the benchmark is easier to understand without it, honestly. If we keep it, then we need a comment explaining that we know this would happen anyway when the function exits, and this does not actually cause the GC to happen, but doing this here before the function exists allows us to sometimes see the cost of GC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the del line now


enum ColorConversionLibrary {
// TODO: Add an AUTO option later.
// Use the libswscale library for color conversion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment looks wrong; should say something about filtergraph.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Done.

# Save tensor to disk
torch.save(img_tensor, base_filename + ".pt", _use_new_zipfile_serialization=True)
extension = os.path.splitext(img_file)[1]
if extension == ".pt":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused - the stated purpose of this script is to convert images (stored as .bmps) to tensors (stored as .pts). This path seems like the opposite: going from tensors to images. Where do we use this behavior? Since this behavior switch is based only on the filename, I can see this biting us in the future. Can we separate this out into a separate script, or make this an explicit script option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can revert this but I want this script to convert from image to tensor and from tensor to image. As coded it does the action based on the extension automatically. It's not easy to get wrong -- I find manual options are sometimes easier to get wrong. WDTY?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted this change. Can add it back in a different diff

def test_color_conversion_library(self, color_conversion_library):
decoder = create_from_file(str(NASA_VIDEO.path))
_add_video_stream(decoder, color_conversion_library=color_conversion_library)
frame0, _, _ = get_next_frame(decoder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

frame0, *_ = get_next_frame(decoder) will also work, and is more future-proof.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea. Done.

@facebook-github-bot
Copy link
Contributor

@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

3 similar comments
@facebook-github-bot
Copy link
Contributor

@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@scotts has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ahmadsharif1 ahmadsharif1 merged commit 1fffd02 into pytorch:main Sep 26, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants