Skip to content

Failed to build TensorRT-LLM whisper Decoder #707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks
muhammad-faizan-122 opened this issue Feb 14, 2025 · 3 comments
Open
4 tasks

Failed to build TensorRT-LLM whisper Decoder #707

muhammad-faizan-122 opened this issue Feb 14, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@muhammad-faizan-122
Copy link

System Info

I was following this whisper-doc to run on triton Inference Server with TensorRT-LLM backend, getting following error after running following command while building TensorRT-LLM engines for Decoder but work fine for encoder.
System specs:

OS: Ubuntu 24
CPU: x86_64

GPU specs:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.6     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:03:00.0 Off |                  Off |
| 30%   30C    P8               8W / 300W |  23516MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

building TensorRT-LLM engines for Decoder

trtllm-build  --checkpoint_dir ${checkpoint_dir}/decoder \
            --output_dir ${output_dir}/decoder \
            --moe_plugin disable \
            --max_beam_width ${MAX_BEAM_WIDTH} \
            --max_batch_size ${MAX_BATCH_SIZE} \
            --max_seq_len 114 \
            --max_input_len 14 \
            --max_encoder_input_len 3000 \
            --gemm_plugin ${INFERENCE_PRECISION} \
            --bert_attention_plugin ${INFERENCE_PRECISION} \
            --gpt_attention_plugin ${INFERENCE_PRECISION}```

Expected behavior

trtllm-build command should return TensorRT-LLM model require which will be require during inference

actual behavior

Faced following Error:

Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/build.py", line 627, in main
    parallel_build(model_config, ckpt_dir, build_config, args.output_dir,
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/build.py", line 425, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/build.py", line 390, in build_and_save
    engine = build_model(build_config,
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/build.py", line 360, in build_model
    model = model_cls.from_checkpoint(ckpt_dir, config=rank_config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/modeling_utils.py", line 653, in from_checkpoint
    model.load(weights, from_pruned=is_checkpoint_pruned)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/modeling_utils.py", line 675, in load
    raise RuntimeError(
RuntimeError: Required but not provided tensors:

additional notes

I used nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3 and used this this script to convert_checkpoints.py

@muhammad-faizan-122 muhammad-faizan-122 added the bug Something isn't working label Feb 14, 2025
@khoshsirat
Copy link

I'm getting the same error for encoder. I'm trying to convert the Whisper large-v3-turbo model:

wget --directory-prefix=assets https://openaipublic.azureedge.net/main/whisper/models/aff26ae408abcba5fbf8813c21e62b0941638c5f6eebfb145be0c9839262a19a/large-v3-turbo.pt

INFERENCE_PRECISION=float16
WEIGHT_ONLY_PRECISION=int8
MAX_BEAM_WIDTH=4
MAX_BATCH_SIZE=32
checkpoint_dir=whisper_large_v3_turbo_weights_${WEIGHT_ONLY_PRECISION}
output_dir=whisper_large_v3_turbo_${WEIGHT_ONLY_PRECISION}

python convert_checkpoint.py \
                --model_dir assets \
                --model_name large-v3-turbo \
                --use_weight_only \
                --weight_only_precision $WEIGHT_ONLY_PRECISION \
                --output_dir $checkpoint_dir

trtllm-build  --checkpoint_dir ${checkpoint_dir}/encoder \
              --output_dir ${output_dir}/encoder \
              --moe_plugin disable \
              --max_batch_size ${MAX_BATCH_SIZE} \
              --gemm_plugin disable \
              --bert_attention_plugin ${INFERENCE_PRECISION} \
              --max_input_len 3000 --max_seq_len=3000

Error:

RuntimeError: Required but not provided tensors:{'encoder_layers.16.attention.dense.weight', 'encoder_layers.11.attention.dense.bias', 'encoder_layers.11.mlp.proj.bias', 'encoder_layers.29.attention.qkv.weight', 'encoder_layers.28.attention.dense.per_channel_scale', 'encoder_layers.13.mlp.fc.per_channel_scale', 'encoder_layers.26.mlp_layernorm.weight', 'encoder_layers.0.attention.dense.per_channel_scale', ...

@khoshsirat
Copy link

So, the problem is that the convert_checkpoint.py file renames the weights with a different prefix. I have attached an updated version that has been tested for the Whisper large-v3-turbo model. (Github does not allow uploading .py files)

convert_checkpoint.txt

@shawl336
Copy link

So, the problem is that the convert_checkpoint.py file renames the weights with a different prefix. I have attached an updated version that has been tested for the Whisper large-v3-turbo model. (Github does not allow uploading .py files)

convert_checkpoint.txt

thanks for sharing the py files, it solves the problem.
However, trtllm-build raise another subsequent error,
"[TRT] [E] ITensor::getDimensions: Error Code 4: API Usage Error (WhisperEncoder/conv1/conv1d_L3471/CONVOLUTION_0: IConvolutionLayer input and kernel must be of same type. input type is Float but kernel is of type Half.)"
it seems related to the trtllm-build arguments.

just wonder that do you pass the trtllm-build with the default arguments in the README.md?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants