1
1
# Contents
2
2
- [ LLaMA-2] ( #llama-2 )
3
+ - [ Prerequisites] ( #prerequisites )
3
4
- [ Exporting LLaMA-2] ( #exporting-llama-2 )
5
+ - [ Examples of Exporting LLaMA-2] ( #examples-of-exporting-llama-2 )
6
+ - [ Parity Checking LLaMA-2] ( #parity-checking-llama-2 )
4
7
- [ Benchmarking LLaMA-2] ( #benchmark-llama-2 )
8
+ - [ Variants] ( #variants )
9
+ - [ Benchmark All] ( #benchmark-all )
10
+ - [ Benchmark E2E] ( #benchmark-e2e )
11
+ - [ E2E Inference with LLaMA-2] ( #e2e-inference-with-llama-2 )
5
12
- [ Mistral] ( #mistral )
6
13
- [ Exporting Mistral] ( #exporting-mistral )
7
14
- [ Optimizing and Quantizing Mistral] ( #optimizing-and-quantizing-mistral )
@@ -229,6 +236,55 @@ $ ./build.sh --config Release --use_cuda --cuda_home /usr/local/cuda-12.2 --cudn
229
236
$ CUDA_VISIBLE_DEVICES=0,1,2,3 bash convert_70b_model.sh 4 -m meta-llama/Llama-2-70b-hf --output llama2-70b-distributed --precision fp16 --execution_provider cuda --use_gqa
230
237
```
231
238
239
+ ## Parity Checking LLaMA-2
240
+
241
+ Here are some examples of how you can use the parity checker to verify your LLaMA-2 ONNX model.
242
+
243
+ 1 . Merged ONNX model, FP32 CPU
244
+ ```
245
+ CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.llama_parity \
246
+ --model_name meta-llama/Llama-2-7b-hf \
247
+ --onnx_model_path ./llama2-7b/rank_0_Llama-2-7b-hf_decoder_merged_model_fp32.onnx \
248
+ --merged \
249
+ --execution_provider cpu \
250
+ --precision fp32 \
251
+ --cache_dir ./model_cache \
252
+ ```
253
+
254
+ 2 . Merged ONNX model, FP32 CUDA
255
+ ```
256
+ CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.llama_parity \
257
+ --model_name meta-llama/Llama-2-7b-hf \
258
+ --onnx_model_path ./llama2-7b/rank_0_Llama-2-7b-hf_decoder_merged_model_fp32.onnx \
259
+ --merged \
260
+ --execution_provider cuda \
261
+ --precision fp32 \
262
+ --cache_dir ./model_cache \
263
+ ```
264
+
265
+ 3 . Merged ONNX model, FP16 CUDA
266
+ ```
267
+ CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.llama_parity \
268
+ --model_name meta-llama/Llama-2-7b-hf \
269
+ --onnx_model_path ./llama2-7b/rank_0_Llama-2-7b-hf_decoder_merged_model_fp32.onnx \
270
+ --merged \
271
+ --execution_provider cuda \
272
+ --precision fp16 \
273
+ --cache_dir ./model_cache \
274
+ ```
275
+
276
+ 4 . Merged ONNX model, FP16 CUDA with GroupQueryAttention
277
+ ```
278
+ CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.llama_parity \
279
+ --model_name meta-llama/Llama-2-7b-hf \
280
+ --onnx_model_path ./llama2-7b/rank_0_Llama-2-7b-hf_decoder_merged_model_fp32.onnx \
281
+ --merged \
282
+ --use_gqa \
283
+ --execution_provider cuda \
284
+ --precision fp16 \
285
+ --cache_dir ./model_cache \
286
+ ```
287
+
232
288
## Benchmark LLaMA-2
233
289
234
290
Here are some examples of how you can benchmark LLaMA-2.
@@ -240,6 +296,7 @@ Here are some examples of how you can benchmark LLaMA-2.
240
296
CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
241
297
--benchmark-type hf-pt-eager \
242
298
--model-name meta-llama/Llama-2-7b-hf \
299
+ --cache-dir ./model_cache \
243
300
--precision fp32 \
244
301
--batch-sizes "1 2" \
245
302
--sequence-lengths "8 16" \
@@ -252,6 +309,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
252
309
CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
253
310
--benchmark-type hf-pt-compile \
254
311
--model-name meta-llama/Llama-2-7b-hf \
312
+ --cache-dir ./model_cache \
255
313
--precision fp16 \
256
314
--batch-sizes "1 2" \
257
315
--sequence-lengths "8 16" \
@@ -265,6 +323,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
265
323
--benchmark-type hf-ort \
266
324
--hf-ort-dir-path ./Llama-2-7b-hf-onnx/ \
267
325
--model-name meta-llama/Llama-2-7b-hf \
326
+ --cache-dir ./model_cache \
268
327
--precision fp32 \
269
328
--batch-sizes "1 2" \
270
329
--sequence-lengths "8 16" \
@@ -278,6 +337,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
278
337
--benchmark-type hf-ort \
279
338
--hf-ort-dir-path ./Llama-2-7b-hf-onnx/ \
280
339
--model-name meta-llama/Llama-2-7b-hf \
340
+ --cache-dir ./model_cache \
281
341
--precision fp16 \
282
342
--batch-sizes "1 2" \
283
343
--sequence-lengths "8 16" \
@@ -291,6 +351,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
291
351
--benchmark-type ort-msft \
292
352
--ort-model-path ./llama-2-onnx/7B_float32/ONNX/LlamaV2_7B_float32.onnx \
293
353
--model-name meta-llama/Llama-2-7b-hf \
354
+ --cache-dir ./model_cache \
294
355
--precision fp32 \
295
356
--batch-sizes "1 2" \
296
357
--sequence-lengths "8 16" \
@@ -303,6 +364,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
303
364
--benchmark-type ort-msft \
304
365
--ort-model-path ./llama-2-onnx/7B_float16/ONNX/LlamaV2_7B_float16.onnx \
305
366
--model-name meta-llama/Llama-2-7b-hf \
367
+ --cache-dir ./model_cache \
306
368
--precision fp16 \
307
369
--batch-sizes "1 2" \
308
370
--sequence-lengths "8 16" \
@@ -315,6 +377,7 @@ CUDA_VISIBLE_DEVICES=1 python3 -m models.llama.benchmark \
315
377
--benchmark-type ort-convert-to-onnx \
316
378
--ort-model-path ./llama2-7b/rank_0_Llama-2-7b-hf_decoder_merged_model_fp32.onnx \
317
379
--model-name meta-llama/Llama-2-7b-hf \
380
+ --cache-dir ./model_cache \
318
381
--precision fp32 \
319
382
--batch-sizes "1 2" \
320
383
--sequence-lengths "8 16" \
@@ -327,6 +390,7 @@ CUDA_VISIBLE_DEVICES=4 python3 -m models.llama.benchmark \
327
390
--benchmark-type ort-convert-to-onnx \
328
391
--ort-model-path ./llama2-7b/rank_0_Llama-2-7b-hf_decoder_merged_model_fp16.onnx \
329
392
--model-name meta-llama/Llama-2-7b-hf \
393
+ --cache-dir ./model_cache \
330
394
--precision fp16 \
331
395
--batch-sizes "1 2" \
332
396
--sequence-lengths "8 16" \
@@ -339,6 +403,7 @@ CUDA_VISIBLE_DEVICES=4,5,6,7 bash benchmark_70b_model.sh 4 \
339
403
--benchmark-type ort-convert-to-onnx \
340
404
--ort-model-path ./llama2-70b-dis/rank_{}_Llama-2-70b-hf_decoder_merged_model_fp16.onnx \
341
405
--model-name meta-llama/Llama-2-70b-hf \
406
+ --cache-dir ./model_cache \
342
407
--precision fp16 \
343
408
--device cuda \
344
409
--warmup-runs 5 \
@@ -357,6 +422,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark_all \
357
422
--ort-convert-to-onnx-model-path ./llama2-7b-fp16/Llama-2-7b-hf_decoder_merged_model_fp16.onnx \
358
423
--ort-msft-model-path ./llama-2-onnx/7B_float16/ONNX/LlamaV2_7B_float16.onnx \
359
424
--model-name meta-llama/Llama-2-7b-hf \
425
+ --cache-dir ./model_cache \
360
426
--precision fp16 \
361
427
--batch-sizes "1 2" \
362
428
--sequence-lengths "8 16" \
@@ -366,6 +432,72 @@ CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark_all \
366
432
--timeout 60 # number of minutes before moving to the next benchmark
367
433
```
368
434
435
+ ### Benchmark E2E
436
+ You can use ` benchmark_e2e.py ` to benchmark the full end-to-end scenario and automatically store the results in a CSV file. This tool uses ` argmax ` for sampling to standardize the benchmarking process.
437
+
438
+ 1 . PyTorch without ` torch.compile ` , FP32
439
+ ```
440
+ CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark_e2e \
441
+ --benchmark-type pt-eager \
442
+ --model-name meta-llama/Llama-2-7b-hf \
443
+ --cache-dir ./model_cache \
444
+ --prompts-file ./models/llama/prompts.json \
445
+ --precision fp32 \
446
+ --batch-sizes "1 2" \
447
+ --prompt-lengths "16 64" \
448
+ --device cpu \
449
+ --auth
450
+ ```
451
+
452
+ 2 . PyTorch with ` torch.compile ` , FP16
453
+ ```
454
+ CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark_e2e \
455
+ --benchmark-type pt-compile \
456
+ --model-name meta-llama/Llama-2-7b-hf \
457
+ --cache-dir ./model_cache \
458
+ --prompts-file ./models/llama/prompts.json \
459
+ --precision fp16 \
460
+ --batch-sizes "1 2" \
461
+ --prompt-lengths "16 64" \
462
+ --device cuda \
463
+ --auth
464
+ ```
465
+
466
+ 3 . ONNX Runtime with ` convert_to_onnx ` , FP32
467
+ ```
468
+ CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark_e2e \
469
+ --benchmark-type ort \
470
+ --model-name meta-llama/Llama-2-7b-hf \
471
+ --cache-dir ./model_cache \
472
+ --onnx-model-path ./llama2-7b/rank_0_Llama-2-7b-hf_decoder_merged_model_fp32.onnx \
473
+ --prompts-file ./models/llama/prompts.json \
474
+ --precision fp32 \
475
+ --batch-sizes "1 2" \
476
+ --prompt-lengths "16 64" \
477
+ --device cpu \
478
+ --auth
479
+ ```
480
+
481
+ 4 . ONNX Runtime with ` convert_to_onnx ` , FP16
482
+ ```
483
+ CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark_e2e \
484
+ --benchmark-type ort \
485
+ --model-name meta-llama/Llama-2-7b-hf \
486
+ --cache-dir ./model_cache \
487
+ --onnx-model-path ./llama2-7b/rank_0_Llama-2-7b-hf_decoder_merged_model_fp32.onnx \
488
+ --prompts-file ./models/llama/prompts.json \
489
+ --precision fp16 \
490
+ --batch-sizes "1 2" \
491
+ --prompt-lengths "16 64" \
492
+ --device cuda \
493
+ --use_buffer_share \
494
+ --auth
495
+ ```
496
+
497
+ ## E2E Inference with LLaMA-2
498
+
499
+ For end-to-end inference, please visit the [ ONNX Runtime Inference Examples folder] ( https://github.com/microsoft/onnxruntime-inference-examples/tree/main/python/models/llama ) for a step-by-step walkthrough, code examples, and performance metrics.
500
+
369
501
# Mistral
370
502
371
503
## Introduction
0 commit comments