Failed to build TensorRT-LLM whisper Decoder #707

muhammad-faizan-122 · 2025-02-14T10:54:03Z

System Info

I was following this whisper-doc to run on triton Inference Server with TensorRT-LLM backend, getting following error after running following command while building TensorRT-LLM engines for Decoder but work fine for encoder.
System specs:

OS: Ubuntu 24
CPU: x86_64

GPU specs:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.6     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:03:00.0 Off |                  Off |
| 30%   30C    P8               8W / 300W |  23516MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

building TensorRT-LLM engines for Decoder

trtllm-build  --checkpoint_dir ${checkpoint_dir}/decoder \
            --output_dir ${output_dir}/decoder \
            --moe_plugin disable \
            --max_beam_width ${MAX_BEAM_WIDTH} \
            --max_batch_size ${MAX_BATCH_SIZE} \
            --max_seq_len 114 \
            --max_input_len 14 \
            --max_encoder_input_len 3000 \
            --gemm_plugin ${INFERENCE_PRECISION} \
            --bert_attention_plugin ${INFERENCE_PRECISION} \
            --gpt_attention_plugin ${INFERENCE_PRECISION}```

Expected behavior

trtllm-build command should return TensorRT-LLM model require which will be require during inference

actual behavior

Faced following Error:

Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/build.py", line 627, in main
    parallel_build(model_config, ckpt_dir, build_config, args.output_dir,
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/build.py", line 425, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/build.py", line 390, in build_and_save
    engine = build_model(build_config,
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/build.py", line 360, in build_model
    model = model_cls.from_checkpoint(ckpt_dir, config=rank_config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/modeling_utils.py", line 653, in from_checkpoint
    model.load(weights, from_pruned=is_checkpoint_pruned)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/modeling_utils.py", line 675, in load
    raise RuntimeError(
RuntimeError: Required but not provided tensors:

additional notes

I used nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3 and used this this script to convert_checkpoints.py

The text was updated successfully, but these errors were encountered:

khoshsirat · 2025-03-04T20:48:29Z

I'm getting the same error for encoder. I'm trying to convert the Whisper large-v3-turbo model:

wget --directory-prefix=assets https://openaipublic.azureedge.net/main/whisper/models/aff26ae408abcba5fbf8813c21e62b0941638c5f6eebfb145be0c9839262a19a/large-v3-turbo.pt

INFERENCE_PRECISION=float16
WEIGHT_ONLY_PRECISION=int8
MAX_BEAM_WIDTH=4
MAX_BATCH_SIZE=32
checkpoint_dir=whisper_large_v3_turbo_weights_${WEIGHT_ONLY_PRECISION}
output_dir=whisper_large_v3_turbo_${WEIGHT_ONLY_PRECISION}

python convert_checkpoint.py \
                --model_dir assets \
                --model_name large-v3-turbo \
                --use_weight_only \
                --weight_only_precision $WEIGHT_ONLY_PRECISION \
                --output_dir $checkpoint_dir

trtllm-build  --checkpoint_dir ${checkpoint_dir}/encoder \
              --output_dir ${output_dir}/encoder \
              --moe_plugin disable \
              --max_batch_size ${MAX_BATCH_SIZE} \
              --gemm_plugin disable \
              --bert_attention_plugin ${INFERENCE_PRECISION} \
              --max_input_len 3000 --max_seq_len=3000

Error:

RuntimeError: Required but not provided tensors:{'encoder_layers.16.attention.dense.weight', 'encoder_layers.11.attention.dense.bias', 'encoder_layers.11.mlp.proj.bias', 'encoder_layers.29.attention.qkv.weight', 'encoder_layers.28.attention.dense.per_channel_scale', 'encoder_layers.13.mlp.fc.per_channel_scale', 'encoder_layers.26.mlp_layernorm.weight', 'encoder_layers.0.attention.dense.per_channel_scale', ...

khoshsirat · 2025-03-04T23:02:25Z

So, the problem is that the convert_checkpoint.py file renames the weights with a different prefix. I have attached an updated version that has been tested for the Whisper large-v3-turbo model. (Github does not allow uploading .py files)

convert_checkpoint.txt

shawl336 · 2025-04-18T02:59:43Z

So, the problem is that the convert_checkpoint.py file renames the weights with a different prefix. I have attached an updated version that has been tested for the Whisper large-v3-turbo model. (Github does not allow uploading .py files)

convert_checkpoint.txt

thanks for sharing the py files, it solves the problem.
However, trtllm-build raise another subsequent error,
"[TRT] [E] ITensor::getDimensions: Error Code 4: API Usage Error (WhisperEncoder/conv1/conv1d_L3471/CONVOLUTION_0: IConvolutionLayer input and kernel must be of same type. input type is Float but kernel is of type Half.)"
it seems related to the trtllm-build arguments.

just wonder that do you pass the trtllm-build with the default arguments in the README.md?

muhammad-faizan-122 added the bug Something isn't working label Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to build TensorRT-LLM whisper Decoder #707

Failed to build TensorRT-LLM whisper Decoder #707

muhammad-faizan-122 commented Feb 14, 2025

khoshsirat commented Mar 4, 2025

khoshsirat commented Mar 4, 2025

shawl336 commented Apr 18, 2025

Failed to build TensorRT-LLM whisper Decoder #707

Failed to build TensorRT-LLM whisper Decoder #707

Comments

muhammad-faizan-122 commented Feb 14, 2025

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

khoshsirat commented Mar 4, 2025

khoshsirat commented Mar 4, 2025

shawl336 commented Apr 18, 2025