You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This folder contains a Jupyter notebook that demonstrates how to export, optimize, and run the LLaMA-2 model with ONNX Runtime. For more details, please see the notebook and the ORT README.
LLaMA-2 7B FP16 CUDA (1 A100 80GB)
Engine
Batch Size
Prompt Length
Prompt Processing Latency (ms)
Prompt Processing Throughput (tps)
Average Latency of First 128 Tokens Generated (ms)
Average Throughput of First 128 Tokens Generated (tps)
Average Latency of First 256 Tokens Generated (ms)
Average Throughput of First 256 Tokens Generated (tps)
Wall-Clock Latency (s)
Wall-Clock Throughput (tps)
onnxruntime
1
16
11.967659
1336.936489
10.52479073
95.01376562
10.54278947
94.85155731
3.08197999
88.25495327
onnxruntime
1
64
12.41350174
5155.676564
10.51662862
95.08750721
10.55776421
94.71702343
3.122560978
104.3349877
onnxruntime
1
256
22.4044323
11426.3105
10.7493531
93.02885402
10.78576129
92.71482774
3.139767647
163.0693916
onnxruntime
1
1024
75.05702972
13642.95928
11.31167263
88.40425575
11.34056505
88.17902774
3.332163334
384.1348313
onnxruntime
1
2048
135.2889538
15137.96909
12.08372787
82.75591863
12.11640146
82.53275559
3.582954168
643.044787
onnxruntime
1
3840
251.5854836
15263.20178
13.44519481
74.37601419
13.48242071
74.17065688
4.047522068
1011.977188
onnxruntime
4
16
12.75753975
5016.641238
10.92023589
366.2924539
10.99625602
363.7601736
3.188626289
341.2127673
onnxruntime
4
64
22.7124691
11271.3417
11.15260646
358.6605531
11.19375136
357.3422236
3.256895304
393.0123264
onnxruntime
4
256
73.77910614
13879.26818
11.26689278
355.0224609
11.35130133
352.3825051
3.345386028
612.186451
onnxruntime
4
1024
250.616312
16343.7087
12.52830587
319.2770068
12.6034962
317.3722542
3.847688437
1330.669072
onnxruntime
4
2048
506.0505867
16188.10494
14.06471804
284.3995869
14.14138451
282.8577355
4.497682095
2049.055448
onnxruntime
4
3840
978.5776138
15696.25115
16.76318049
238.6182026
16.83990005
237.531101
5.664571524
2892.363514
onnxruntime
16
16
21.32916451
12002.34542
11.54885069
1385.419245
11.97430678
1336.194261
3.479871035
1250.621059
onnxruntime
16
64
73.28677177
13972.50793
11.71741821
1365.488516
12.04443816
1328.413977
3.52155304
1453.903986
onnxruntime
16
256
248.3313084
16494.09423
12.81819306
1248.225855
13.13442457
1218.172894
3.978744745
2058.940828
onnxruntime
16
1024
975.6298065
16793.25487
16.74189232
955.6864715
17.06122886
937.7988026
5.703416586
3590.830109
onnxruntime
16
2048
1993.696928
16435.79801
22.16357179
721.9053026
22.49017637
711.4217219
8.114635229
4542.902911
onnxruntime
16
3840
3924.712181
15654.65113
31.63040616
505.8423822
31.95275087
500.7393594
12.46947217
5255.715648
pytorch-eager
1
16
32.97473
485.2201
31.95276
31.2962
31.8423
31.40477
8.28506
32.83018
pytorch-eager
1
64
32.63447
1961.117
31.33203
31.91622
31.36941
31.87819
8.164876
39.19227
pytorch-eager
1
256
34.46941
7426.875
31.69294
31.55277
31.53167
31.71414
8.207787
62.37979
pytorch-eager
1
1024
103.928
9852.975
31.84283
31.40424
31.80877
31.43787
8.408238
152.2317
pytorch-eager
1
2048
244.3801
8380.386
32.11394
31.13912
32.11288
31.14015
8.720115
264.2167
pytorch-eager
1
3840
611.0726
6284.032
32.04668
31.20448
32.02001
31.23048
9.293344
440.7455
pytorch-eager
4
16
32.7481
1954.312
31.60442
126.5646
31.45407
127.1696
8.18083
132.9938
pytorch-eager
4
64
33.18802
7713.626
31.20292
128.1931
31.26663
127.9319
8.132635
157.3906
pytorch-eager
4
256
89.22571
11476.51
31.29607
127.8116
31.29617
127.8111
8.248695
248.2817
pytorch-eager
4
1024
392.79
10427.96
31.26839
127.9247
31.22812
128.0897
8.707226
588.0174
pytorch-eager
4
2048
955.0025
8577.988
31.27921
127.8805
31.28768
127.8458
9.992102
922.3284
pytorch-eager
4
3840
2467.054
6226.05
31.35273
127.5806
31.33206
127.6647
15.97773
1025.427
pytorch-eager
16
16
33.24208
7701.083
31.49257
508.0563
32.28204
495.6316
8.396241
518.3272
pytorch-eager
16
64
86.10473
11892.49
31.38509
509.7962
32.1652
497.432
8.467185
604.6874
pytorch-eager
16
256
332.7774
12308.53
32.50902
492.171
32.52583
491.9167
8.955728
914.7219
pytorch-eager
16
1024
1543.535
10614.59
33.13551
482.8656
33.08991
483.5311
16.21622
1262.933
pytorch-eager
16
2048
3856.058
8497.797
33.05266
484.0761
33.0516
484.0915
28.2955
1302.822
pytorch-eager
16
3840
OOM
OOM
OOM
OOM
OOM
OOM
OOM
OOM
pytorch-compile
1
16
12.97314
1233.317
15.06754
66.36782
14.94154
66.9275
3.949777
68.86465
pytorch-compile
1
64
13.30206
4811.285
14.92108
67.01927
14.79991
67.56797
3.913011
81.77846
pytorch-compile
1
256
21.05542
12158.39
14.74479
67.82056
14.73028
67.88739
3.938656
129.9936
pytorch-compile
1
1024
75.77764
13513.22
14.73202
67.87934
14.64023
68.30496
4.274032
299.483
pytorch-compile
1
2048
159.0262
12878.38
14.77194
67.69592
14.69836
68.0348
5.412601
425.6734
pytorch-compile
1
3840
339.8554
11298.92
14.06384
71.10431
14.04383
71.20565
7.099414
576.949
pytorch-compile
4
16
14.82386
4317.365
14.85815
269.2126
14.85617
269.2483
3.927973
276.9877
pytorch-compile
4
64
20.7674
12327.01
14.78437
270.556
14.80803
270.1237
3.955819
323.574
pytorch-compile
4
256
70.34887
14556.03
14.9404
267.7305
14.9841
266.9496
4.313815
474.7537
pytorch-compile
4
1024
290.767
14086.88
15.47449
258.49
15.64255
255.7128
6.704303
763.6886
pytorch-compile
4
2048
644.6995
12706.69
17.22421
232.2313
17.09852
233.9384
10.17314
905.9152
pytorch-compile
4
3840
1488.37
10320.01
17.54926
227.9298
16.19208
247.0344
16.26342
1007.414
pytorch-compile
16
16
20.62188
12414
16.14968
990.7319
17.28572
925.6194
5.190115
838.5171
pytorch-compile
16
64
68.86672
14869.3
15.93814
1003.881
17.00272
941.0257
5.524729
926.7423
pytorch-compile
16
256
262.5498
15600.85
16.28529
982.4817
18.87143
847.8423
7.905223
1036.277
pytorch-compile
16
1024
1134.517
14441.39
19.28937
829.4722
20.54817
778.6581
16.55617
1237.001
pytorch-compile
16
2048
3682.501
8898.3
32.4632
492.8657
32.31265
495.1621
28.07167
1313.21
pytorch-compile
16
3840
OOM
OOM
OOM
OOM
OOM
OOM
OOM
OOM
LLaMA-2 13B FP16 CUDA (1 A100 80GB)
Engine
Batch Size
Prompt Length
Prompt Processing Latency (ms)
Prompt Processing Throughput (tps)
Average Latency of First 128 Tokens Generated (ms)
Average Throughput of First 128 Tokens Generated (tps)
Average Latency of First 256 Tokens Generated (ms)
Average Throughput of First 256 Tokens Generated (tps)