[Audio] Support Audio Datasets #1085

kylesayrs · 2025-01-20T20:53:16Z

Purpose

Support oneshot with audio datasets

Changes

Extend apply_pad_mask_to_batch to handle cases where there are no input_ids and where there might be decoder_input_ids
Extend TextGenerationDataset to detect if a dataset is already tokenized based on processor.model_input_names rather than only input_ids

Testing

Ran test_processors.py to completion, which verifies that the model_input_names attribute is defined for most processors
Ran whisper to completion in [Audio] Qwen Audio Example #1082

test_processors.py

import pytest
from transformers import AutoProcessor

@pytest.mark.parametrize(
    "model_id,expected",
    [
        ("meta-llama/Meta-Llama-3-8B-Instruct", ["input_ids", "attention_mask"]),
        ("mistralai/Mixtral-8x7B-Instruct-v0.1", ["input_ids", "attention_mask"]),
        (
            "Qwen/Qwen2-VL-2B-Instruct",
            [
                "input_ids",
                "attention_mask",
                "pixel_values",
                "image_grid_thw",
                "pixel_values_videos",
                "video_grid_thw",
            ],
        ),
        ("mgoin/pixtral-12b", ["input_ids", "attention_mask", "pixel_values"]),
        ("openai/whisper-large-v2", ["input_features"]),
        (
            "Qwen/Qwen2-Audio-7B-Instruct",
            ["input_ids", "attention_mask", "input_features", "feature_attention_mask"],
        ),
    ],
)
def test_processor_model_input_names(model_id, expected):
    """
    Tests the model_input_names attribute of common model processors
    """

    processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
    assert processor.model_input_names == expected

Signed-off-by: Kyle Sayers <[email protected]>

github-actions · 2025-01-20T20:53:27Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

src/llmcompressor/transformers/finetune/data/base.py

Signed-off-by: Kyle Sayers <[email protected]>

src/llmcompressor/modifiers/utils/pytorch_helpers.py

Signed-off-by: Kyle Sayers <[email protected]>

## Purpose ## * Provide a predefined audio dataset for * Testing traceability of audio models * e2e tests with audio models * Simpler examples (blog) ## Prerequisites ## * #1030 * #1085 ## Changes ## * Implement `PeoplesSpeech` dataset * Because of the more complex nature of audio processors, this dataset needs to hardcode some processing logic specific to models * Assumes that most processing is similar to whisper processing, which seems to be the standard * Because processing changes depending on the model, this means mapped outputs cannot be cached * Add `load_from_cache_file` argument to preprocessing mapping (this was overlooked before) * Integrate dataset with tracing debugger tool ## Testing ## ```bash llmcompressor.trace \ --model_id openai/whisper-large-v2\ --model_class TraceableWhisperForConditionalGeneration\ --modality audio ``` Traceable definition of qwen2_audio is not finished yet, but this loads and is accepted as valid input ```bash llmcompressor.trace \ --model_id Qwen/Qwen2-Audio-7B\ --model_class Qwen2AudioForConditionalGeneration\ --modality audio ``` --------- Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs added 2 commits January 20, 2025 20:46

support audio datasets

2f3a416

Signed-off-by: Kyle Sayers <[email protected]>

mask decoder_input_ids

74283e8

Signed-off-by: Kyle Sayers <[email protected]>

This was referenced Jan 20, 2025

[Audio] Qwen Audio Example #1082

Closed

[Audio] People's Speech dataset and tracer tool #1086

Merged

kylesayrs self-assigned this Jan 20, 2025

kylesayrs added the ready When a PR is ready for review label Jan 20, 2025

dsikka reviewed Jan 20, 2025

View reviewed changes

src/llmcompressor/transformers/finetune/data/base.py Show resolved Hide resolved

add comment

498e598

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs requested a review from dsikka January 22, 2025 14:58

Merge branch 'main' into kylesayrs/audio-datasets

bbf26c6

dsikka previously approved these changes Jan 22, 2025

View reviewed changes

Merge branch 'main' into kylesayrs/audio-datasets

9aab7b9

horheynm reviewed Jan 22, 2025

View reviewed changes

src/llmcompressor/modifiers/utils/pytorch_helpers.py Outdated Show resolved Hide resolved

rewrite for clarity

c862c0f

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs dismissed dsikka’s stale review via c862c0f January 22, 2025 21:09

kylesayrs requested review from dsikka and horheynm January 22, 2025 21:10

Merge branch 'main' into kylesayrs/audio-datasets

a476597

dsikka approved these changes Jan 22, 2025

View reviewed changes

horheynm approved these changes Jan 22, 2025

View reviewed changes

mgoin approved these changes Jan 22, 2025

View reviewed changes

mgoin merged commit fb01d66 into main Jan 22, 2025
6 of 7 checks passed

mgoin deleted the kylesayrs/audio-datasets branch January 22, 2025 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Audio] Support Audio Datasets #1085

[Audio] Support Audio Datasets #1085

Uh oh!

kylesayrs commented Jan 20, 2025

Uh oh!

github-actions bot commented Jan 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Audio] Support Audio Datasets #1085

[Audio] Support Audio Datasets #1085

Uh oh!

Conversation

kylesayrs commented Jan 20, 2025

Purpose

Changes

Testing

Uh oh!

github-actions bot commented Jan 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!