[Benchmark] Add sampling parameters to benchmark_serving. #16022

hyeygit · 2025-04-03T15:47:57Z

Allow specifying sampling params (top-k, top-p etc) in the online benchmark. This is done by adding the sampling params to the "extra_body" field in the client request.

New command line flags (--top-p, --top-k, --min-p, and --temperature) are added to benchmark_serving.py.

Example benchmark results

After locally enabling top-k and top-p sampling for TPU, I ran the serving benchmark on v6e-1, Llama3.1-8B, with --top-k=10 and --top-p=0.8. Here are the results before/after the TPU optimization. The no-sampling baseline is also included for comparison.

Description	Throughput (req/s)
Baseline (NO sampling)	5.98
Baseline w/ non-optimized sampling (forward_native)	0.24
TPU-optimized sampling	5.57

Allow specifying sampling params (top-k, top-p etc) in the online benchmark. This is done by adding the params to the "extra_body" field in the client request. Signed-off-by: Hyesoo Yang <[email protected]>

github-actions · 2025-04-03T15:48:06Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

simon-mo · 2025-04-03T16:31:48Z

Can you explain why is this needed? sampling algorithms shouldn't change the perf result much.

hyeygit · 2025-04-03T17:27:27Z

Can you explain why is this needed? sampling algorithms shouldn't change the perf result much.

@simon-mo Sampling (specifically top-k and top-p) can be slow on certain backend such as XLA/TPU. We recently optimized top-p sampling for TPU to make it orders of magnitude faster (#15736). It will be nice to see the impact on end-to-end serving performance (and based on my local testing the top-p optimization sped up serving throughput by 23X on TPU).

ywang96 · 2025-04-03T20:39:56Z

@JenZhao Can you give this PR a first pass? Thanks!

JenZhao · 2025-04-03T20:41:05Z

Thank you! It looks good to me. Could you provide some examples that demonstrate the impact of sampling on performance results, and also update the README to include this change?

ywang96

@hyeygit I left two comments - PTAL

ywang96 · 2025-04-03T20:55:19Z

benchmarks/benchmark_serving.py

+    sampling_group.add_argument("--top-p",
+                                type=float,
+                                default=None,
+                                help="Top-p sampling parameter.")
+    sampling_group.add_argument("--top-k",
+                                type=int,
+                                default=None,
+                                help="Top-k sampling parameter.")
+    sampling_group.add_argument("--min-p",
+                                type=float,
+                                default=None,
+                                help="Min-p sampling parameter.")
+    sampling_group.add_argument("--temperature",
+                                type=float,
+                                default=None,
+                                help="Temperature sampling parameter.")


These are only passed to async_request_openai_completions and async_request_openai_chat_completions. Please update the help messages here to reflect this and IMO we should add an assertion to make sure users are only using openai completions or chat completions whenever any of these is specified.

Thanks for the review! Good catch! Done.

ywang96 · 2025-04-03T20:55:58Z

benchmarks/benchmark_serving.py

+    sampling_group.add_argument("--temperature",
+                                type=float,
+                                default=None,
+                                help="Temperature sampling parameter.")


I think it's probably better if we set a default value of 0.0 here for temperature, or otherwise indicate in the help message that greedy decoding is used if temperature is not specified`. WDYT?

SG. Set the default to 0.0 and also explained in help message.

NickLucche

Nice one @hyeygit , I think this is generally useful to test the e2e impact of different sampling implementations!

NickLucche · 2025-04-04T07:18:30Z

benchmarks/benchmark_serving.py

+            'top_p': args.top_p,
+            'top_k': args.top_k,
+            'min_p': args.min_p,
+            'temperature': args.temperature


Shouldn't we remove the default temperature value that appears in async_request_openai_completions and async_request_openai_chat_completions?
Since it gets replaced anyway every time, I feel it hurts readability a bit.

Thanks for the review @NickLucche ! I think keeping the default temperature in backend_request_func.py might be better because this file is likely used by other modules as well and not just by benchmark_serving.py, so we probably shouldn't modify its default behavior.

Signed-off-by: Hyesoo Yang <[email protected]>

mergify · 2025-04-04T20:20:26Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @hyeygit.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Hyesoo Yang <[email protected]>

hyeygit · 2025-04-04T20:31:01Z

Thank you all for the review! Comments addressed. PTAL.

@JenZhao I updated the PR description with example benchmark result. Also updated the README.

ywang96

LGTM! Thanks for the contribution

…ct#16022) Signed-off-by: Hyesoo Yang <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

…ct#16022) Signed-off-by: Hyesoo Yang <[email protected]>

Add sampling parameters to benchmark_serving.

a436d32

Allow specifying sampling params (top-k, top-p etc) in the online benchmark. This is done by adding the params to the "extra_body" field in the client request. Signed-off-by: Hyesoo Yang <[email protected]>

hyeygit changed the title ~~Add sampling parameters to benchmark_serving.~~ [Benchmark] Add sampling parameters to benchmark_serving. Apr 3, 2025

hyeygit closed this Apr 3, 2025

hyeygit reopened this Apr 3, 2025

simon-mo requested a review from ywang96 April 3, 2025 17:28

ywang96 reviewed Apr 3, 2025

View reviewed changes

NickLucche suggested changes Apr 4, 2025

View reviewed changes

hyeygit added 2 commits April 4, 2025 19:59

Address review feedback

e0b8be1

Signed-off-by: Hyesoo Yang <[email protected]>

Add sampling example to README

aafb827

Signed-off-by: Hyesoo Yang <[email protected]>

mergify bot added the needs-rebase label Apr 4, 2025

Merge branch 'main' into bench_serving_no_base

c206a30

Signed-off-by: Hyesoo Yang <[email protected]>

mergify bot removed the needs-rebase label Apr 4, 2025

ywang96 approved these changes Apr 5, 2025

View reviewed changes

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 5, 2025

DarkLight1337 merged commit ba10801 into vllm-project:main Apr 6, 2025
42 checks passed

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[Benchmark] Add sampling parameters to benchmark_serving. (vllm-proje…

6f88773

…ct#16022) Signed-off-by: Hyesoo Yang <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

lengrongfu pushed a commit to lengrongfu/vllm that referenced this pull request Apr 7, 2025

[Benchmark] Add sampling parameters to benchmark_serving. (vllm-proje…

ba126b9

…ct#16022) Signed-off-by: Hyesoo Yang <[email protected]>

NickLucche mentioned this pull request Apr 8, 2025

[RFC]: TPU V1 Sampler planning #16268

Open

11 tasks

nishith-fujitsu pushed a commit to nishith-fujitsu/vllm that referenced this pull request Apr 9, 2025

[Benchmark] Add sampling parameters to benchmark_serving. (vllm-proje…

9fed518

…ct#16022) Signed-off-by: Hyesoo Yang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Benchmark] Add sampling parameters to benchmark_serving. #16022

[Benchmark] Add sampling parameters to benchmark_serving. #16022

hyeygit commented Apr 3, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Apr 3, 2025

simon-mo commented Apr 3, 2025

hyeygit commented Apr 3, 2025 •

edited

Loading

ywang96 commented Apr 3, 2025

JenZhao commented Apr 3, 2025

ywang96 left a comment

ywang96 Apr 3, 2025

hyeygit Apr 4, 2025

ywang96 Apr 3, 2025

hyeygit Apr 4, 2025

NickLucche left a comment

NickLucche Apr 4, 2025

hyeygit Apr 4, 2025

mergify bot commented Apr 4, 2025

hyeygit commented Apr 4, 2025

ywang96 left a comment

[Benchmark] Add sampling parameters to benchmark_serving. #16022

[Benchmark] Add sampling parameters to benchmark_serving. #16022

Conversation

hyeygit commented Apr 3, 2025 • edited by github-actions bot Loading

Example benchmark results

github-actions bot commented Apr 3, 2025

simon-mo commented Apr 3, 2025

hyeygit commented Apr 3, 2025 • edited Loading

ywang96 commented Apr 3, 2025

JenZhao commented Apr 3, 2025

ywang96 left a comment

Choose a reason for hiding this comment

ywang96 Apr 3, 2025

Choose a reason for hiding this comment

hyeygit Apr 4, 2025

Choose a reason for hiding this comment

ywang96 Apr 3, 2025

Choose a reason for hiding this comment

hyeygit Apr 4, 2025

Choose a reason for hiding this comment

NickLucche left a comment

Choose a reason for hiding this comment

NickLucche Apr 4, 2025

Choose a reason for hiding this comment

hyeygit Apr 4, 2025

Choose a reason for hiding this comment

mergify bot commented Apr 4, 2025

hyeygit commented Apr 4, 2025

ywang96 left a comment

Choose a reason for hiding this comment

hyeygit commented Apr 3, 2025 •

edited by github-actions bot

Loading

hyeygit commented Apr 3, 2025 •

edited

Loading