-
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Benchmark] Add sampling parameters to benchmark_serving. #16022
[Benchmark] Add sampling parameters to benchmark_serving. #16022
Conversation
Allow specifying sampling params (top-k, top-p etc) in the online benchmark. This is done by adding the params to the "extra_body" field in the client request. Signed-off-by: Hyesoo Yang <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Can you explain why is this needed? sampling algorithms shouldn't change the perf result much. |
@simon-mo Sampling (specifically top-k and top-p) can be slow on certain backend such as XLA/TPU. We recently optimized top-p sampling for TPU to make it orders of magnitude faster (#15736). It will be nice to see the impact on end-to-end serving performance (and based on my local testing the top-p optimization sped up serving throughput by 23X on TPU). |
@JenZhao Can you give this PR a first pass? Thanks! |
Thank you! It looks good to me. Could you provide some examples that demonstrate the impact of sampling on performance results, and also update the README to include this change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hyeygit I left two comments - PTAL
benchmarks/benchmark_serving.py
Outdated
sampling_group.add_argument("--top-p", | ||
type=float, | ||
default=None, | ||
help="Top-p sampling parameter.") | ||
sampling_group.add_argument("--top-k", | ||
type=int, | ||
default=None, | ||
help="Top-k sampling parameter.") | ||
sampling_group.add_argument("--min-p", | ||
type=float, | ||
default=None, | ||
help="Min-p sampling parameter.") | ||
sampling_group.add_argument("--temperature", | ||
type=float, | ||
default=None, | ||
help="Temperature sampling parameter.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are only passed to async_request_openai_completions
and async_request_openai_chat_completions
. Please update the help messages here to reflect this and IMO we should add an assertion to make sure users are only using openai completions or chat completions whenever any of these is specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! Good catch! Done.
benchmarks/benchmark_serving.py
Outdated
sampling_group.add_argument("--temperature", | ||
type=float, | ||
default=None, | ||
help="Temperature sampling parameter.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's probably better if we set a default value of 0.0
here for temperature, or otherwise indicate in the help message that greedy decoding is used if temperature is not specified`. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG. Set the default to 0.0 and also explained in help message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one @hyeygit , I think this is generally useful to test the e2e impact of different sampling implementations!
benchmarks/benchmark_serving.py
Outdated
'top_p': args.top_p, | ||
'top_k': args.top_k, | ||
'min_p': args.min_p, | ||
'temperature': args.temperature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we remove the default temperature value that appears in async_request_openai_completions
and async_request_openai_chat_completions
?
Since it gets replaced anyway every time, I feel it hurts readability a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @NickLucche ! I think keeping the default temperature in backend_request_func.py
might be better because this file is likely used by other modules as well and not just by benchmark_serving.py
, so we probably shouldn't modify its default behavior.
Signed-off-by: Hyesoo Yang <[email protected]>
Signed-off-by: Hyesoo Yang <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Hyesoo Yang <[email protected]>
Thank you all for the review! Comments addressed. PTAL. @JenZhao I updated the PR description with example benchmark result. Also updated the README. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the contribution
…ct#16022) Signed-off-by: Hyesoo Yang <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
…ct#16022) Signed-off-by: Hyesoo Yang <[email protected]>
…ct#16022) Signed-off-by: Hyesoo Yang <[email protected]>
Allow specifying sampling params (top-k, top-p etc) in the online benchmark. This is done by adding the sampling params to the "extra_body" field in the client request.
New command line flags (
--top-p
,--top-k
,--min-p
, and--temperature
) are added tobenchmark_serving.py
.Example benchmark results
After locally enabling top-k and top-p sampling for TPU, I ran the serving benchmark on v6e-1, Llama3.1-8B, with
--top-k=10
and--top-p=0.8
. Here are the results before/after the TPU optimization. The no-sampling baseline is also included for comparison.