-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path #8378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -983,9 +983,16 @@ def __init__(self, | |
policy: str = "fcfs") -> None: | ||
if max_num_batched_tokens is None: | ||
if enable_chunked_prefill: | ||
# It is the values that have the best balance between ITL | ||
# and TTFT on A100. Note it is not optimized for throughput. | ||
max_num_batched_tokens = 512 | ||
if num_scheduler_steps > 1: | ||
# Multi-step Chunked-Prefill doesn't allow prompt-chunking | ||
# for now. Have max_num_batched_tokens set to max_model_len | ||
# so we don't reject sequences on account of a short | ||
# max_num_batched_tokens. | ||
max_num_batched_tokens = max(max_model_len, 2048) | ||
comaniac marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not simply set max_num_batched_tokens = max_model_len? What's the reason for 2048 here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is an
I replicated the same. This argument is the token-buget in the Scheduler. I believe it is so we can schedule more prefills and not be limited by the small There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see, ok |
||
else: | ||
# It is the values that have the best balance between ITL | ||
# and TTFT on A100. Note it is not optimized for throughput. | ||
max_num_batched_tokens = 512 | ||
else: | ||
# If max_model_len is too short, use 2048 as the default value | ||
# for higher throughput. | ||
|
Uh oh!
There was an error while loading. Please reload this page.