You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Config: H200, nvidia-modelopt v0.21.1, TensorRT-LLM v0.15, latency measured with [trtllm-bench](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-overview.md#for-non-gh200-systems-1).
12
+
Inference speedup are compared to the BF16 baseline. **Speedup is normalized to the GPU count**.
13
+
14
+
> Benchmark scenario: Input tokens 2048, output tokens 128. Real performance may vary based on the target usecases and flags used to build the TensorRT-LLM engine.
15
+
16
+
> Memory saving is not reported here as TensorRT-LLM occupies all the remaining available GPU memory for KV caching.
17
+
18
+
> If the GPU memory is the limitation, lower bit quantization may have better GPU-count-normalized throughput gain with fewer TP.
Note that typically FP8 is the go-to choices for H100. 4-bit AWQ methods is recommended when GPU memory is a constraint.
29
35
More benchmark with earlier version of Model Optimizer can be found in this [TensorRT-LLM README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/quantization-in-TRT-LLM.md#benchmark).
30
-
|Model \\ MMLU loss| FP8 |INT4 AWQ|
31
-
|:----------:|:-------------:|:--------:|
32
-
|Llama3-8B |0.46% |4.60%|
33
-
| Llama3-70b | 0.51% |1.29%|
36
+
|Model| MMLU loss FP8 |MMLU loss INT4 AWQ|MMLU loss W4A8 AWQ|
0 commit comments