Skip to content

Commit 111b2da

Browse files
Update LLM Perf benchmarks for 0.21
1 parent eee97da commit 111b2da

File tree

1 file changed

+25
-19
lines changed

1 file changed

+25
-19
lines changed

benchmark.md

+25-19
Original file line numberDiff line numberDiff line change
@@ -8,29 +8,35 @@ performance** that can be delivered by Model Optimizer. All performance numbers
88

99
#### 1.1 Performance
1010

11-
Config: H100, nvidia-modelopt v0.15.0, TensorRT-LLM v0.11, latency measured with full batch inference (no inflight batching).
12-
Memory saving and inference speedup are compared to the FP16 baseline. Speedup is normalized to the GPU count.
13-
14-
| | | | FP8 | | | | INT4 AWQ | |
15-
|:----------:|:----------:|:----------:|:----------:|:-------:|:-:|:----------:|:----------:|:-------:|
16-
| Model | Batch Size | Mem Saving | Tokens/sec | Speedup | | Mem Saving | Tokens/sec | Speedup |
17-
| Llama3-8B | 1 | 1.63x | 175.42 | 1.26x | | 2.34x | 213.45 | 1.53x |
18-
| | 32 | 1.62x | 3399.84 | 1.49x | | 1.89x | 2546.12 | 1.11x |
19-
| | 64 | 1.58x | 3311.03 | 1.34x | | 1.97x | 3438.08 | 1.39x |
20-
| Llama3-70B | 1 | 1.96x | 32.85 | 1.87x | | 3.47x | 47.49 | 2.70x |
21-
| | 32 | 1.93x | 462.69 | 1.82x | | 2.62x | 365.06 | 1.44x |
22-
| | 64 | 1.99x | 449.09 | 1.91x | | 2.90x | 483.51 | 2.05x |
11+
Config: H200, nvidia-modelopt v0.21.1, TensorRT-LLM v0.15, latency measured with [trtllm-bench](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-overview.md#for-non-gh200-systems-1).
12+
Inference speedup are compared to the BF16 baseline. **Speedup is normalized to the GPU count**.
13+
14+
> Benchmark scenario: Input tokens 2048, output tokens 128. Real performance may vary based on the target usecases and flags used to build the TensorRT-LLM engine.
15+
16+
> Memory saving is not reported here as TensorRT-LLM occupies all the remaining available GPU memory for KV caching.
17+
18+
> If the GPU memory is the limitation, lower bit quantization may have better GPU-count-normalized throughput gain with fewer TP.
19+
20+
| | | BF16 (8B:TP1, 70B:TP2) | | FP8 (TP1) | | |INT4 AWQ (TP1)| | |W4A8 AWQ (TP1)| |
21+
|:------------:|:----------:|:----------------------:|:-:|:------------:|:-------:|:-:|:------------:|:-------:|:-:|:------------:|:-------:|
22+
| Model | Batch Size | Tokens/sec | | Tokens/sec | Speedup | | Tokens/sec | Speedup | | Tokens/sec | Speedup |
23+
| Llama3.1-8B | 1 | 173.80 | | 245.03 | 1.41x | | 231.75 | 1.33x | | 239.70 | 1.38x |
24+
| | 8 | 803.11 | | 1,051.17 | 1.31x | | 599.72 | 0.75x | | 801.72 | 1.00x |
25+
| | 64 | 1,679.74 | | 2,190.93 | 1.30x | | 1,392.78 | 0.83x | | 1,930.86 | 1.15x |
26+
| Llama3.1-70B | 1 | 45.81 | | 43.46 | 1.90x | | 44.10 | 1.93x | | 46.31 | 2.02x |
27+
| | 8 | 182.61 | | 182.07 | 1.99x | | 93.98 | 1.03x | | 140.02 | 1.53x |
28+
| | 64 | 401.50 | | 420.64 | 2.10x | | 176.68 | 0.88x | | 345.43 | 1.72x |
2329

2430
### 1.2 Accuracy
2531

26-
The table below shows the MMLU loss in percentage compared to FP16 baseline.
27-
Config: H100, nvidia-modelopt v0.11.0, TenorR-LLM v0.9.
28-
Note that typically FP8 or INT4 AWQ is the go-to choices for H100.
32+
The table below shows the MMLU loss in percentage compared to BF16 baseline.
33+
Config: H100, nvidia-modelopt v0.21.1, TenorR-LLM v0.15.
34+
Note that typically FP8 is the go-to choices for H100. 4-bit AWQ methods is recommended when GPU memory is a constraint.
2935
More benchmark with earlier version of Model Optimizer can be found in this [TensorRT-LLM README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/quantization-in-TRT-LLM.md#benchmark).
30-
| Model \\ MMLU loss | FP8 | INT4 AWQ |
31-
|:----------:|:-------------:|:--------:|
32-
| Llama3-8B | 0.46% | 4.60% |
33-
| Llama3-70b | 0.51% | 1.29% |
36+
| Model | MMLU loss FP8 |MMLU loss INT4 AWQ|MMLU loss W4A8 AWQ|
37+
|:-----------------------:|:-------------:|:----------------:|:----------------:|
38+
| Llama3.1-8B (instruct) | 1.50% | 5.66% | 6.00% |
39+
| Llama3.1-70B (instruct) | 0.38% | 1.07% | 1.20% |
3440

3541
## 2. PTQ for Stable Diffusion
3642

0 commit comments

Comments
 (0)