Audio decoding support: range-based core API #538

NicolasHug · 2025-03-06T12:35:15Z

This PR adds the get_frames_by_pts_in_range_audio(start_seconds, stop_seconds=None) -> Tensor core API.

It returns a 2D tensor of shape (num_channels, num_samples).
- We don't return pts or duration. We might eventually do it, I'm open to it, but those are directly deductible from start_seconds and the sample rate, so I'm leaving it out.
- We're not returning something of shape (e.g.) (num_frames, num_channels, num_samples_per_frame), and we never will, because audio frames generally contain variable number of samples. I found out the hard way.
It is a frame-based API. That's OK for a core API. The public decoder method will be sample-based, but that's out of scope for now.
It allows consecutive calls, but only if it doesn't require a backwards seek (that'll be implemented later). See the tests.
stop_seconds is None by default, so that users can decode to the end of the file without knowing what the duration is. setting stop_seconds=<some super high value> doesn't raise an error either. (Note that IMHO we should extend this to all range-based APIs: Extend SimpleVideoDecoder index-based APIs #150)

We've discussed a lot offline already, so I won't be writing down everything that went into the design decisions here. But once everything is done, I'll make sure to write down a note in the code that documents why audio decoding is implemented the way it is.

Preliminary benchmarks against torchaudio (built from source) are very promising, even for this non-optimized first version:

Duration: 13s
torchcodec: med = 8.01ms +- 1.03
torchaudio: med = 11.90ms +- 0.60

Duration: 13s
torchcodec: med = 4.07ms +- 0.73
torchaudio: med = 7.22ms +- 0.63

Duration: 2m11s
torchcodec: med = 30.18ms +- 0.75
torchaudio: med = 45.17ms +- 1.85

Duration: 1h27m
torchcodec: med = 1060.43ms +- 23.78
torchaudio: med = 1746.49ms +- 22.55

Code:

from torchcodec.decoders import _core as core
from torchaudio.io import StreamReader
import torch
from time import perf_counter_ns



def bench(f, *args, num_exp=100, warmup=0, **kwargs):

    for _ in range(warmup):
        f(*args, **kwargs)

    times = []
    for _ in range(num_exp):
        start = perf_counter_ns()
        f(*args, **kwargs)
        end = perf_counter_ns()
        times.append(end - start)
    return torch.tensor(times).float()

def report_stats(times, unit="ms", prefix=""):
    mul = {
        "ns": 1,
        "µs": 1e-3,
        "ms": 1e-6,
        "s": 1e-9,
    }[unit]
    times = times * mul
    std = times.std().item()
    med = times.median().item()
    print(f"{prefix}: {med = :.2f}{unit} +- {std:.2f}")
    return med



def codec(path, stream_index):
    decoder = core.create_from_file(path, seek_mode="approximate")
    core.add_audio_stream(decoder, stream_index=stream_index)

    core.get_frames_by_pts_in_range_audio(decoder, start_seconds=0, stop_seconds=None)

def audio(path, stream_index):
    reader = StreamReader(path)
    reader.add_audio_stream(frames_per_chunk=1024, stream_index=stream_index)
    for _ in reader.stream():
        pass

NUM_EXP = 30
WARMUP = 1
for path, stream_index, duration in (
    ("/home/nicolashug/dev/torchcodec/test/resources/nasa_13013.mp4", 4, "13s"),
    ("/home/nicolashug/dev/torchcodec/test/resources/nasa_13013.mp4.audio.mp3", 0, "13s"),
    ("/home/nicolashug/test_videos/long.mp4", 1, "2m11s"),
    ("/home/nicolashug/test_videos/output.mp4", 1, "1h27m"),
):
    print(f"Duration: {duration}")
    times = bench(codec, path, stream_index, num_exp=NUM_EXP, warmup=WARMUP)
    report_stats(times, prefix="torchcodec")

    times = bench(audio, path, stream_index, num_exp=NUM_EXP, warmup=WARMUP)
    report_stats(times, prefix="torchaudio")
    print()

There will be plenty of follow-ups, mainly:

enable backwards seeks
enable audio formats other than fltp
enable a user-defined sample_rate
expose a public method in AudioDecoder
maybe, maybe not: something like the normalize parameter of the torchaudio reader, which allows users to specify whether they want a float tensor in [-1, 1], or a tensor with the same dtype as the audio format. We'll figure that out later. We'll probably always return float tensors by default anyway.
perf: try to pre-allocate the output tensor and save copies

NicolasHug · 2025-03-09T13:07:01Z

src/torchcodec/decoders/_core/VideoDecoder.cpp

+      tensors.push_back(frameOutput.data);
+    } catch (const EndOfFileException& e) {
+      reachedEOF = true;
+    }


Q about C++ best practices: I realize we're already doing it in a few places (like custom ops), but is it a good practice to use exceptions for control flow? Maybe the reachedEOF flag from decodeAVFrame() could be a stateful attribute instead? (not that I find statefulness appealing either!)

Since we shouldn't ordinarily decode past the end of a file, I think it makes sense for us to throw exceptions when we reach the end of a file. Here, we're not really using an exception for control flow per se. That is, we're not trying to read past the end of the file, we just have to handle the case that we might.

With that said, I do find it more natural when the "normal" stop conditions are explicitly part of the while loop's condition, as opposed to setting a boolean inside the loop. But since you're depending on the internal state of the decoder to know the last decoded frame info, I don't know if that's possible. When I implemented something similar, I ended up using a priming read to get around this problem.

Also, nit: for local variables used in a small space with a clear purpose, I prefer shorter names. So even stop as the boolean would make this easier for me to read.

OK. I'll use finished instead of stop, because we already have local variables named stopPts and stopSeconds in this function.

NicolasHug · 2025-03-09T13:08:12Z

test/decoders/test_decoders.py

+            asset.duration_seconds
+        )
+        assert decoder.metadata.sample_rate == asset.sample_rate
+        assert decoder.metadata.num_channels == asset.num_channels


Changes here are mostly a drive-by. I'm extending this one because I removed test_audio_get_json_metadata which was outdated.

NicolasHug · 2025-03-09T14:00:31Z

test/decoders/test_ops.py

+
+            return get_frames_by_pts_in_range_audio(
+                decoder, start_seconds=start_seconds, stop_seconds=stop_seconds
+            )


This test is complex enough and I didn't want to obfuscate it further with pts-to-index conversions, so I created this stateless helper.

Shouldn't this helper get its reference from the the test utils? Right now it's getting frames by decoding the file, which means we're not actually comparing against a reference - or am I missing something here?

You're correct. This test compares a stateless decoder (treated as the ref) with a stateful decoder (which is what users interact with). So it still asserts what we need to assert.

Relying on the references would mean converting all the timestamps into indices, and as mentioned in my comment just above, I wanted to avoid complicating this test further.

Eventually we will update this test (i.e. when we enable backwards seeking), at which point we could just rely on the reference frames.

I was hoping this would make reviewing easier, but apparently it's bringing more confusion :p .If you prefer converting to indices straight away, let me know.

Ohhhh, okay, I had misunderstood what you meant about avoiding the conversions. It's fine to leave as-is, but let's put in a comment explaining that we're comparing a decoder which only seeks once to a decoder which seeks multiple times along with a TODO to convert it to loading the reference frames from indices.

src/torchcodec/decoders/_core/FFMPEGCommon.cpp

src/torchcodec/decoders/_core/VideoDecoder.h

test/decoders/test_ops.py

scotts · 2025-03-11T19:37:05Z

test/utils.py

+                if frame_info.pts_seconds
+                <= pts_seconds
+                < frame_info.pts_seconds + frame_info.duration_seconds
+            )


I have to admit I don't completely understand what's going on here. I get that we have a generator that does a linear walk through the frames, searching for the first frame that meets our conditions, but how does passing that generator to next() get us back the one index we want?

Also, it may be better to do this once, in __post_init__() and set up a mapping rather than doing it on every call.

Unfortunately we can't build a mapping for this because we are mapping contiguous timestamps to integers. I'll add a comment that bisect might make things faster if needed (although for such small arrays, the tests run very fast).

next(it) just returns the very first entry in it, whether it is a list, a tuple, a generator, etc. And then we build the generator in such a way that the first frame in the generator is the frame we want. What's going on is exactly the same as the snippet below:

>>> gen = (i for i in range(10) if i > 5) >>> next(gen) 6

The values in [0, 5] were never part of gen, they were only part of the output from range(). The first entry in gen is 6.

how does passing that generator to next() get us back the one index we want?

We iterate over (frame_index, frame_info) tuples, filter those that don't meet our condition, and then only store frame_index within the generator:

>>> list(i for (i, j) in zip((1, 2, 3), ("a", "b", "c"))) [1, 2, 3] >>> list(j for (i, j) in zip((1, 2, 3), ("a", "b", "c"))) ['a', 'b', 'c']

Wow! I never thought about it that way, in that next() on a generator is effectively like lst[0] on a sequence.

Add basic range support

ae15304

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 6, 2025

NicolasHug added 19 commits March 6, 2025 17:25

Add more tests

29e0b8d

Merge branch 'main' of github.com:pytorch/torchcodec into audioooooooo

cad69da

Add separate audio decoding method

04f6282

Merge branch 'main' of github.com:pytorch/torchcodec into audioooooooo

f8dfcda

More stuff

da40954

Cleanups

3881586

Remove old code

82bea4a

More validation, more tests

ce12f03

remove next() support

59b0d15

Rename

f4bed23

Add support for None stop_seconds

fe04cd2

Remove pre-alloc logic

98fee85

Add test

d2357fe

Add proper error when backward seek is neede

f3b56f8

Cleanup

5f2800a

Add TODO

2f020f2

Put back original compilation flags

0c11f72

Fix

de4facc

nit

b5f2df0

NicolasHug commented Mar 9, 2025

View reviewed changes

NicolasHug changed the title ~~Audio decoding support - range API~~ Audio decoding support: range-based core API Mar 9, 2025

NicolasHug marked this pull request as ready for review March 9, 2025 13:22

NicolasHug commented Mar 9, 2025

View reviewed changes

NicolasHug added 4 commits March 9, 2025 16:28

Oops, fix

09e6f44

Add case for start=stop

3d955c1

Simplify

d791d2a

Don't use a lambda

0c0f62b

NicolasHug added 2 commits March 11, 2025 09:54

Merge branch 'main' of github.com:pytorch/torchcodec into audioooooooo

893c358

Fix

c35ae47

NicolasHug mentioned this pull request Mar 11, 2025

Simplify get_frame_played_at #545

Draft