Skip to content

Commit 25d5f16

Browse files
kunal-vaishnaviTed Themistokleous
authored and
Ted Themistokleous
committed
Add LLaMA end-to-end benchmarking (microsoft#19985)
### Description This PR adds a benchmarking script to measure end-to-end performance and saves the results in a CSV file. ### Motivation and Context With this PR, end-to-end performance can be easily measured for many large-language models such as LLaMA-2. The performance numbers for LLaMA-2 are located [here](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/python/models/llama).
1 parent 166204d commit 25d5f16

11 files changed

+957
-23
lines changed

onnxruntime/python/tools/transformers/models/llama/README.md

+132
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,14 @@
11
# Contents
22
- [LLaMA-2](#llama-2)
3+
- [Prerequisites](#prerequisites)
34
- [Exporting LLaMA-2](#exporting-llama-2)
5+
- [Examples of Exporting LLaMA-2](#examples-of-exporting-llama-2)
6+
- [Parity Checking LLaMA-2](#parity-checking-llama-2)
47
- [Benchmarking LLaMA-2](#benchmark-llama-2)
8+
- [Variants](#variants)
9+
- [Benchmark All](#benchmark-all)
10+
- [Benchmark E2E](#benchmark-e2e)
11+
- [E2E Inference with LLaMA-2](#e2e-inference-with-llama-2)
512
- [Mistral](#mistral)
613
- [Exporting Mistral](#exporting-mistral)
714
- [Optimizing and Quantizing Mistral](#optimizing-and-quantizing-mistral)
@@ -229,6 +236,55 @@ $ ./build.sh --config Release --use_cuda --cuda_home /usr/local/cuda-12.2 --cudn
229236
$ CUDA_VISIBLE_DEVICES=0,1,2,3 bash convert_70b_model.sh 4 -m meta-llama/Llama-2-70b-hf --output llama2-70b-distributed --precision fp16 --execution_provider cuda --use_gqa
230237
```
231238

239+
## Parity Checking LLaMA-2
240+
241+
Here are some examples of how you can use the parity checker to verify your LLaMA-2 ONNX model.
242+
243+
1. Merged ONNX model, FP32 CPU
244+
```
245+
CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.llama_parity \
246+
--model_name meta-llama/Llama-2-7b-hf \
247+
--onnx_model_path ./llama2-7b/rank_0_Llama-2-7b-hf_decoder_merged_model_fp32.onnx \
248+
--merged \
249+
--execution_provider cpu \
250+
--precision fp32 \
251+
--cache_dir ./model_cache \
252+
```
253+
254+
2. Merged ONNX model, FP32 CUDA
255+
```
256+
CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.llama_parity \
257+
--model_name meta-llama/Llama-2-7b-hf \
258+
--onnx_model_path ./llama2-7b/rank_0_Llama-2-7b-hf_decoder_merged_model_fp32.onnx \
259+
--merged \
260+
--execution_provider cuda \
261+
--precision fp32 \
262+
--cache_dir ./model_cache \
263+
```
264+
265+
3. Merged ONNX model, FP16 CUDA
266+
```
267+
CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.llama_parity \
268+
--model_name meta-llama/Llama-2-7b-hf \
269+
--onnx_model_path ./llama2-7b/rank_0_Llama-2-7b-hf_decoder_merged_model_fp32.onnx \
270+
--merged \
271+
--execution_provider cuda \
272+
--precision fp16 \
273+
--cache_dir ./model_cache \
274+
```
275+
276+
4. Merged ONNX model, FP16 CUDA with GroupQueryAttention
277+
```
278+
CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.llama_parity \
279+
--model_name meta-llama/Llama-2-7b-hf \
280+
--onnx_model_path ./llama2-7b/rank_0_Llama-2-7b-hf_decoder_merged_model_fp32.onnx \
281+
--merged \
282+
--use_gqa \
283+
--execution_provider cuda \
284+
--precision fp16 \
285+
--cache_dir ./model_cache \
286+
```
287+
232288
## Benchmark LLaMA-2
233289

234290
Here are some examples of how you can benchmark LLaMA-2.
@@ -240,6 +296,7 @@ Here are some examples of how you can benchmark LLaMA-2.
240296
CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
241297
--benchmark-type hf-pt-eager \
242298
--model-name meta-llama/Llama-2-7b-hf \
299+
--cache-dir ./model_cache \
243300
--precision fp32 \
244301
--batch-sizes "1 2" \
245302
--sequence-lengths "8 16" \
@@ -252,6 +309,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
252309
CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
253310
--benchmark-type hf-pt-compile \
254311
--model-name meta-llama/Llama-2-7b-hf \
312+
--cache-dir ./model_cache \
255313
--precision fp16 \
256314
--batch-sizes "1 2" \
257315
--sequence-lengths "8 16" \
@@ -265,6 +323,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
265323
--benchmark-type hf-ort \
266324
--hf-ort-dir-path ./Llama-2-7b-hf-onnx/ \
267325
--model-name meta-llama/Llama-2-7b-hf \
326+
--cache-dir ./model_cache \
268327
--precision fp32 \
269328
--batch-sizes "1 2" \
270329
--sequence-lengths "8 16" \
@@ -278,6 +337,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
278337
--benchmark-type hf-ort \
279338
--hf-ort-dir-path ./Llama-2-7b-hf-onnx/ \
280339
--model-name meta-llama/Llama-2-7b-hf \
340+
--cache-dir ./model_cache \
281341
--precision fp16 \
282342
--batch-sizes "1 2" \
283343
--sequence-lengths "8 16" \
@@ -291,6 +351,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
291351
--benchmark-type ort-msft \
292352
--ort-model-path ./llama-2-onnx/7B_float32/ONNX/LlamaV2_7B_float32.onnx \
293353
--model-name meta-llama/Llama-2-7b-hf \
354+
--cache-dir ./model_cache \
294355
--precision fp32 \
295356
--batch-sizes "1 2" \
296357
--sequence-lengths "8 16" \
@@ -303,6 +364,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
303364
--benchmark-type ort-msft \
304365
--ort-model-path ./llama-2-onnx/7B_float16/ONNX/LlamaV2_7B_float16.onnx \
305366
--model-name meta-llama/Llama-2-7b-hf \
367+
--cache-dir ./model_cache \
306368
--precision fp16 \
307369
--batch-sizes "1 2" \
308370
--sequence-lengths "8 16" \
@@ -315,6 +377,7 @@ CUDA_VISIBLE_DEVICES=1 python3 -m models.llama.benchmark \
315377
--benchmark-type ort-convert-to-onnx \
316378
--ort-model-path ./llama2-7b/rank_0_Llama-2-7b-hf_decoder_merged_model_fp32.onnx \
317379
--model-name meta-llama/Llama-2-7b-hf \
380+
--cache-dir ./model_cache \
318381
--precision fp32 \
319382
--batch-sizes "1 2" \
320383
--sequence-lengths "8 16" \
@@ -327,6 +390,7 @@ CUDA_VISIBLE_DEVICES=4 python3 -m models.llama.benchmark \
327390
--benchmark-type ort-convert-to-onnx \
328391
--ort-model-path ./llama2-7b/rank_0_Llama-2-7b-hf_decoder_merged_model_fp16.onnx \
329392
--model-name meta-llama/Llama-2-7b-hf \
393+
--cache-dir ./model_cache \
330394
--precision fp16 \
331395
--batch-sizes "1 2" \
332396
--sequence-lengths "8 16" \
@@ -339,6 +403,7 @@ CUDA_VISIBLE_DEVICES=4,5,6,7 bash benchmark_70b_model.sh 4 \
339403
--benchmark-type ort-convert-to-onnx \
340404
--ort-model-path ./llama2-70b-dis/rank_{}_Llama-2-70b-hf_decoder_merged_model_fp16.onnx \
341405
--model-name meta-llama/Llama-2-70b-hf \
406+
--cache-dir ./model_cache \
342407
--precision fp16 \
343408
--device cuda \
344409
--warmup-runs 5 \
@@ -357,6 +422,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark_all \
357422
--ort-convert-to-onnx-model-path ./llama2-7b-fp16/Llama-2-7b-hf_decoder_merged_model_fp16.onnx \
358423
--ort-msft-model-path ./llama-2-onnx/7B_float16/ONNX/LlamaV2_7B_float16.onnx \
359424
--model-name meta-llama/Llama-2-7b-hf \
425+
--cache-dir ./model_cache \
360426
--precision fp16 \
361427
--batch-sizes "1 2" \
362428
--sequence-lengths "8 16" \
@@ -366,6 +432,72 @@ CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark_all \
366432
--timeout 60 # number of minutes before moving to the next benchmark
367433
```
368434

435+
### Benchmark E2E
436+
You can use `benchmark_e2e.py` to benchmark the full end-to-end scenario and automatically store the results in a CSV file. This tool uses `argmax` for sampling to standardize the benchmarking process.
437+
438+
1. PyTorch without `torch.compile`, FP32
439+
```
440+
CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark_e2e \
441+
--benchmark-type pt-eager \
442+
--model-name meta-llama/Llama-2-7b-hf \
443+
--cache-dir ./model_cache \
444+
--prompts-file ./models/llama/prompts.json \
445+
--precision fp32 \
446+
--batch-sizes "1 2" \
447+
--prompt-lengths "16 64" \
448+
--device cpu \
449+
--auth
450+
```
451+
452+
2. PyTorch with `torch.compile`, FP16
453+
```
454+
CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark_e2e \
455+
--benchmark-type pt-compile \
456+
--model-name meta-llama/Llama-2-7b-hf \
457+
--cache-dir ./model_cache \
458+
--prompts-file ./models/llama/prompts.json \
459+
--precision fp16 \
460+
--batch-sizes "1 2" \
461+
--prompt-lengths "16 64" \
462+
--device cuda \
463+
--auth
464+
```
465+
466+
3. ONNX Runtime with `convert_to_onnx`, FP32
467+
```
468+
CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark_e2e \
469+
--benchmark-type ort \
470+
--model-name meta-llama/Llama-2-7b-hf \
471+
--cache-dir ./model_cache \
472+
--onnx-model-path ./llama2-7b/rank_0_Llama-2-7b-hf_decoder_merged_model_fp32.onnx \
473+
--prompts-file ./models/llama/prompts.json \
474+
--precision fp32 \
475+
--batch-sizes "1 2" \
476+
--prompt-lengths "16 64" \
477+
--device cpu \
478+
--auth
479+
```
480+
481+
4. ONNX Runtime with `convert_to_onnx`, FP16
482+
```
483+
CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark_e2e \
484+
--benchmark-type ort \
485+
--model-name meta-llama/Llama-2-7b-hf \
486+
--cache-dir ./model_cache \
487+
--onnx-model-path ./llama2-7b/rank_0_Llama-2-7b-hf_decoder_merged_model_fp32.onnx \
488+
--prompts-file ./models/llama/prompts.json \
489+
--precision fp16 \
490+
--batch-sizes "1 2" \
491+
--prompt-lengths "16 64" \
492+
--device cuda \
493+
--use_buffer_share \
494+
--auth
495+
```
496+
497+
## E2E Inference with LLaMA-2
498+
499+
For end-to-end inference, please visit the [ONNX Runtime Inference Examples folder](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/python/models/llama) for a step-by-step walkthrough, code examples, and performance metrics.
500+
369501
# Mistral
370502

371503
## Introduction

onnxruntime/python/tools/transformers/models/llama/benchmark.py

+20-18
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
# -------------------------------------------------------------------------
2+
# Copyright (c) Microsoft Corporation. All rights reserved.
3+
# Licensed under the MIT License. See License.txt in the project root for
4+
# license information.
5+
# --------------------------------------------------------------------------
16
import argparse
27
import datetime
38
import gc
@@ -14,11 +19,12 @@
1419
from benchmark_helper import measure_memory, setup_logger
1520
from dist_settings import get_rank, get_size
1621
from llama_inputs import (
17-
add_io_bindings,
22+
add_io_bindings_as_ortvalues,
1823
get_merged_sample_with_past_kv_inputs,
1924
get_msft_sample_inputs,
2025
get_sample_inputs,
2126
get_sample_with_past_kv_inputs,
27+
verify_ort_inputs,
2228
)
2329
from optimum.onnxruntime import ORTModelForCausalLM
2430
from torch.profiler import ProfilerActivity, profile, record_function
@@ -199,6 +205,7 @@ def get_model(args: argparse.Namespace):
199205
torch_dtype=torch.float16 if args.use_fp16 else torch.float32,
200206
use_auth_token=args.auth,
201207
use_cache=True,
208+
cache_dir=args.cache_dir,
202209
).to(args.target_device)
203210
end_time = time.time()
204211

@@ -444,24 +451,12 @@ def get_logits(inputs):
444451

445452
def run_ort_inference(args, init_inputs, iter_inputs, model):
446453
def prepare_ort_inputs(inputs, kv_cache_ortvalues):
447-
# Check that all model inputs will be provided
448-
model_inputs = set(map(lambda model_input: model_input.name, model.get_inputs()))
449-
user_inputs = set(inputs.keys())
450-
missing_inputs = model_inputs - user_inputs
451-
if len(missing_inputs):
452-
logger.error(f"The following model inputs are missing: {missing_inputs}")
453-
raise Exception("There are missing inputs to the model. Please add them and try again.")
454-
455-
# Remove unnecessary inputs from model inputs
456-
unnecessary_inputs = user_inputs - model_inputs
457-
if len(unnecessary_inputs):
458-
for unnecessary_input in unnecessary_inputs:
459-
logger.info(f"Removing unnecessary input '{unnecessary_input}' from user provided inputs")
460-
del inputs[unnecessary_input]
454+
# Verify model inputs
455+
inputs = verify_ort_inputs(model, inputs)
461456

462457
# Add IO bindings for non-CPU execution providers
463458
if args.device != "cpu":
464-
io_binding, kv_cache_ortvalues = add_io_bindings(
459+
io_binding, kv_cache_ortvalues = add_io_bindings_as_ortvalues(
465460
model, inputs, args.device, int(args.rank), args.use_gqa, kv_cache_ortvalues
466461
)
467462
setattr(args, "io_binding", io_binding) # noqa: B010
@@ -612,6 +607,13 @@ def get_args(rank=0):
612607
parser.add_argument("--pt-num-rows", type=int, default=1000, help="Number of rows for PyTorch profiler to display")
613608
parser.add_argument("--verbose", default=False, action="store_true")
614609
parser.add_argument("--log-folder", type=str, default=os.path.join("."), help="Folder to cache log files")
610+
parser.add_argument(
611+
"--cache-dir",
612+
type=str,
613+
required=True,
614+
default="./model_cache",
615+
help="Cache dir where Hugging Face files are stored",
616+
)
615617

616618
args = parser.parse_args()
617619

@@ -662,8 +664,8 @@ def main():
662664

663665
args.rank = rank
664666
args.world_size = world_size
665-
tokenizer = AutoTokenizer.from_pretrained(args.model_name)
666-
config = AutoConfig.from_pretrained(args.model_name)
667+
tokenizer = AutoTokenizer.from_pretrained(args.model_name, cache_dir=args.cache_dir)
668+
config = AutoConfig.from_pretrained(args.model_name, cache_dir=args.cache_dir)
667669
target_device = f"cuda:{args.rank}" if args.device != "cpu" else args.device
668670
use_fp16 = args.precision == "fp16"
669671

onnxruntime/python/tools/transformers/models/llama/benchmark_all.py

+22
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
# -------------------------------------------------------------------------
2+
# Copyright (c) Microsoft Corporation. All rights reserved.
3+
# Licensed under the MIT License. See License.txt in the project root for
4+
# license information.
5+
# --------------------------------------------------------------------------
16
import argparse
27
import datetime
38
import json
@@ -78,6 +83,13 @@ def get_args():
7883
help="Path to ONNX model from convert_to_onnx",
7984
)
8085

86+
parser.add_argument(
87+
"--cache-dir",
88+
type=str,
89+
default="./model_cache",
90+
help="Cache dir where Hugging Face files are stored",
91+
)
92+
8193
parser.add_argument(
8294
"--model-name",
8395
type=str,
@@ -332,6 +344,8 @@ def main():
332344
str(args.num_runs),
333345
"--log-folder",
334346
args.log_folder,
347+
"--cache-dir",
348+
args.cache_dir,
335349
"--auth",
336350
]
337351
logger.info("Benchmark PyTorch without torch.compile")
@@ -362,6 +376,8 @@ def main():
362376
str(args.num_runs),
363377
"--log-folder",
364378
args.log_folder,
379+
"--cache-dir",
380+
args.cache_dir,
365381
"--auth",
366382
]
367383
logger.info("Benchmark PyTorch with torch.compile")
@@ -394,6 +410,8 @@ def main():
394410
str(args.num_runs),
395411
"--log-folder",
396412
args.log_folder,
413+
"--cache-dir",
414+
args.cache_dir,
397415
"--auth",
398416
]
399417
logger.info("Benchmark Optimum + ONNX Runtime")
@@ -426,6 +444,8 @@ def main():
426444
str(args.num_runs),
427445
"--log-folder",
428446
args.log_folder,
447+
"--cache-dir",
448+
args.cache_dir,
429449
]
430450
logger.info("Benchmark Microsoft model in ONNX Runtime")
431451
results = benchmark(args, benchmark_cmd, "ort-msft")
@@ -457,6 +477,8 @@ def main():
457477
str(args.num_runs),
458478
"--log-folder",
459479
args.log_folder,
480+
"--cache-dir",
481+
args.cache_dir,
460482
]
461483
logger.info("Benchmark convert_to_onnx model in ONNX Runtime")
462484
results = benchmark(args, benchmark_cmd, "onnxruntime")

0 commit comments

Comments
 (0)