Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Benchmark] Add sampling parameters to benchmark_serving. #16022

Merged
merged 4 commits into from
Apr 6, 2025

Conversation

hyeygit
Copy link
Contributor

@hyeygit hyeygit commented Apr 3, 2025

Allow specifying sampling params (top-k, top-p etc) in the online benchmark. This is done by adding the sampling params to the "extra_body" field in the client request.

New command line flags (--top-p, --top-k, --min-p, and --temperature) are added to benchmark_serving.py.

Example benchmark results

After locally enabling top-k and top-p sampling for TPU, I ran the serving benchmark on v6e-1, Llama3.1-8B, with --top-k=10 and --top-p=0.8. Here are the results before/after the TPU optimization. The no-sampling baseline is also included for comparison.

Description Throughput (req/s)
Baseline (NO sampling) 5.98
Baseline w/ non-optimized sampling (forward_native) 0.24
TPU-optimized sampling 5.57

Allow specifying sampling params (top-k, top-p etc) in the online benchmark.
This is done by adding the params to the "extra_body" field in the client
request.

Signed-off-by: Hyesoo Yang <[email protected]>
Copy link

github-actions bot commented Apr 3, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@hyeygit hyeygit changed the title Add sampling parameters to benchmark_serving. [Benchmark] Add sampling parameters to benchmark_serving. Apr 3, 2025
@hyeygit hyeygit closed this Apr 3, 2025
@hyeygit hyeygit reopened this Apr 3, 2025
@simon-mo
Copy link
Collaborator

simon-mo commented Apr 3, 2025

Can you explain why is this needed? sampling algorithms shouldn't change the perf result much.

@hyeygit
Copy link
Contributor Author

hyeygit commented Apr 3, 2025

Can you explain why is this needed? sampling algorithms shouldn't change the perf result much.

@simon-mo Sampling (specifically top-k and top-p) can be slow on certain backend such as XLA/TPU. We recently optimized top-p sampling for TPU to make it orders of magnitude faster (#15736). It will be nice to see the impact on end-to-end serving performance (and based on my local testing the top-p optimization sped up serving throughput by 23X on TPU).

@simon-mo simon-mo requested a review from ywang96 April 3, 2025 17:28
@ywang96
Copy link
Member

ywang96 commented Apr 3, 2025

@JenZhao Can you give this PR a first pass? Thanks!

@JenZhao
Copy link
Contributor

JenZhao commented Apr 3, 2025

Thank you! It looks good to me. Could you provide some examples that demonstrate the impact of sampling on performance results, and also update the README to include this change?

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hyeygit I left two comments - PTAL

Comment on lines 1022 to 1037
sampling_group.add_argument("--top-p",
type=float,
default=None,
help="Top-p sampling parameter.")
sampling_group.add_argument("--top-k",
type=int,
default=None,
help="Top-k sampling parameter.")
sampling_group.add_argument("--min-p",
type=float,
default=None,
help="Min-p sampling parameter.")
sampling_group.add_argument("--temperature",
type=float,
default=None,
help="Temperature sampling parameter.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are only passed to async_request_openai_completions and async_request_openai_chat_completions. Please update the help messages here to reflect this and IMO we should add an assertion to make sure users are only using openai completions or chat completions whenever any of these is specified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Good catch! Done.

Comment on lines 1034 to 1037
sampling_group.add_argument("--temperature",
type=float,
default=None,
help="Temperature sampling parameter.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's probably better if we set a default value of 0.0 here for temperature, or otherwise indicate in the help message that greedy decoding is used if temperature is not specified`. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG. Set the default to 0.0 and also explained in help message.

Copy link
Contributor

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one @hyeygit , I think this is generally useful to test the e2e impact of different sampling implementations!

'top_p': args.top_p,
'top_k': args.top_k,
'min_p': args.min_p,
'temperature': args.temperature
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we remove the default temperature value that appears in async_request_openai_completions and async_request_openai_chat_completions?
Since it gets replaced anyway every time, I feel it hurts readability a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @NickLucche ! I think keeping the default temperature in backend_request_func.py might be better because this file is likely used by other modules as well and not just by benchmark_serving.py, so we probably shouldn't modify its default behavior.

hyeygit added 2 commits April 4, 2025 19:59
Signed-off-by: Hyesoo Yang <[email protected]>
Copy link

mergify bot commented Apr 4, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @hyeygit.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 4, 2025
@mergify mergify bot removed the needs-rebase label Apr 4, 2025
@hyeygit
Copy link
Contributor Author

hyeygit commented Apr 4, 2025

Thank you all for the review! Comments addressed. PTAL.

@JenZhao I updated the PR description with example benchmark result. Also updated the README.

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the contribution

@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 5, 2025
@DarkLight1337 DarkLight1337 merged commit ba10801 into vllm-project:main Apr 6, 2025
42 checks passed
lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
lengrongfu pushed a commit to lengrongfu/vllm that referenced this pull request Apr 7, 2025
@NickLucche NickLucche mentioned this pull request Apr 8, 2025
11 tasks
nishith-fujitsu pushed a commit to nishith-fujitsu/vllm that referenced this pull request Apr 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants