-
Notifications
You must be signed in to change notification settings - Fork 30
Allow option to use the swscale library for color conversion instead of filtergraph #205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, I made a quick review. I have further questions / discussion points that I'll take offline for now
auto allocateToConvertDone = | ||
std::chrono::duration_cast<std::chrono::microseconds>( | ||
convertDone - allocateDone); | ||
auto total = std::chrono::duration_cast<std::chrono::microseconds>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want such benchmarking code within the decoder? I think it makes the code significantly harder to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will refactor this in a different diff. It's probably not that bad. A self-profiling decoder is great imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things tend to look more obvious in the eyes of the code author :)
In this block there are 12 statements, and 5 of those are dedicated to the core logic. The 7 remaining ones are about benchmarking.
Do we expect such benchmark info to be forever useful, or is this more of a one-time thing?
Before leaving this out for a follow-up diff, I would suggest to protect the benchmarking logic within #ifdef
blocks, so that they can easily be collapsed on the IDE to ease code review / reading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think benchmark would be useful forever.
I think #ifdefs may make it even harder to read.
Later on I plan to add a timing class that will have its own code #ifdefs commented out for non-developer builds.
Should I really add #ifdefs in this diff? Leaving this unresolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually agree with @NicolasHug that this code is now hard to read. My reasons, which some are the same as what Nicolas said:
- The timing code takes up more lines than the actual logic.
- Because we're using
auto
as the type, I have to actually read each timing line to realize it's a timing line. I can't just skim over them and focus on the actual logic. (Unfortunately, explicitly stating the return type would be even noisier.) - We're not using any new lines to group code together. "Paragraphs" of code would help me figure out what I can ignore as timing, as I expect a paragraph of code to have a start and stop timing call.
This timing code is very granular. I understand that in this PR, the amount of time just a few statements took was key to getting a big perf win, but I don't think that necessarily means we should keep it all the time. I consider it similar to putting in print statements when debugging. Some print statements graduate to actual logging, but most don't, and we delete most debugging print statements at commit time.
I would prefer either:
- We remove the timing code in this PR, and wait until we have a readable abstraction to commit it. OR:
- We limit the timing code to function boundaries. Yes, that will not be enough to fully diagnose perf problems, but it's enough to know where to start digging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disagree and commit. Removed those lines
// There is a bug in sws_scale where it doesn't handle non-multiple of 16 | ||
// widths. | ||
// https://stackoverflow.com/questions/74351955/turn-off-sw-scale-conversion-to-planar-yuv-32-byte-alignment-requirements | ||
// In that case we are forced to use a filtergraph to do the color conversion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did that bug affect the benchmarks in any way?
Also, at what level do we want the user to know: docs, warnings, error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment we wont warn anyone since there are no users. Idealy we don't warn and just do the right thing under the AUTO enum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, I made a quick review. I have further questions / discussion points that I'll take offline for now
@@ -325,7 +339,7 @@ void VideoDecoder::initializeFilterGraphForStream( | |||
width = *options.width; | |||
height = *options.height; | |||
} | |||
std::snprintf(description, sizeof(description), "scale=%d:%d", width, height); | |||
std::snprintf(description, sizeof(description), "scale=%d:%d:sws_flags=bilinear", width, height); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a priority now, but we can do this in a safer way with std::stringstream
; it won't require allocating a static sized buffer.
frames = [] | ||
for pts in pts_list: | ||
reader.seek(pts) | ||
frame = next(reader) | ||
frames.append(frame["data"].permute(1, 2, 0)) | ||
frames_done = timeit.default_timer() | ||
if self._print_each_iteration_time: | ||
del reader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you intending to time how long it takes to do the deallocations of memory on the C++ side? Note that del reader
on the Python side will only decrement a reference counter. You can make an explicit call to the garbage collector (see https://docs.python.org/3/library/gc.html#gc.collect), but that's going to do a full collection of all garbage. If you're trying to time how long it takes to deallocate the objects on the C++ side, I don't know if there's a way to do that reliably from Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was for my own debugging/edification and is turned off by default. A private variable controls it.
I can delete it if you want. It's developer-only code -- not user-facing.
Let me know.
I timed the code btw -- when using pytorch to time it, the aggregate profile doesn't change when I call del decoder
on it. It's just that if you print the iters, you can see a timing difference once in every few iters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the benchmark is easier to understand without it, honestly. If we keep it, then we need a comment explaining that we know this would happen anyway when the function exits, and this does not actually cause the GC to happen, but doing this here before the function exists allows us to sometimes see the cost of GC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the del line now
|
||
enum ColorConversionLibrary { | ||
// TODO: Add an AUTO option later. | ||
// Use the libswscale library for color conversion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment looks wrong; should say something about filtergraph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Done.
test/convert_image_to_tensor.py
Outdated
# Save tensor to disk | ||
torch.save(img_tensor, base_filename + ".pt", _use_new_zipfile_serialization=True) | ||
extension = os.path.splitext(img_file)[1] | ||
if extension == ".pt": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused - the stated purpose of this script is to convert images (stored as .bmps) to tensors (stored as .pts). This path seems like the opposite: going from tensors to images. Where do we use this behavior? Since this behavior switch is based only on the filename, I can see this biting us in the future. Can we separate this out into a separate script, or make this an explicit script option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can revert this but I want this script to convert from image to tensor and from tensor to image. As coded it does the action based on the extension automatically. It's not easy to get wrong -- I find manual options are sometimes easier to get wrong. WDTY?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted this change. Can add it back in a different diff
def test_color_conversion_library(self, color_conversion_library): | ||
decoder = create_from_file(str(NASA_VIDEO.path)) | ||
_add_video_stream(decoder, color_conversion_library=color_conversion_library) | ||
frame0, _, _ = get_next_frame(decoder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
frame0, *_ = get_next_frame(decoder)
will also work, and is more future-proof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea. Done.
@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
3 similar comments
@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@scotts has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1 similar comment
@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
This PR does the following:
Benchmark results show performance improvement: