-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
[Performance] Enable chunked prefill and prefix caching together #7753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] Enable chunked prefill and prefix caching together #7753
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
result seems very good!! |
Hi @comaniac @rkooo567 I would like you folks to notice my last commit on #6144 (a043643). Without it, this PR is still incorrect, and the error can be reproduced with even a single request:
You will find that with this PR, at the first round, tokens[0:64] is prefilled, at the second round, tokens[96:119] is prefilled, and the tokens between 64 and 96 are skipped. This is because the num_computed_blocks is incorrectly updated as the whole block table for prompt tokens, rather than tokens that are prefilled at the first round. |
IIUC, this PR already guarantees every sequence will have at least one block to compute even it fully hits the cache, so it shouldn't trigger the issue you mentioned? If I missed anything, can you modify the unit test added in this PR so that the problem can be exposed and tested? |
It is not about fully matched. In the case commented above, there are only 1 request, and the prefill are spited to [0:64] and [64:120], and the second part is treated as prefix matched as the computed_block_nums are updated to [0,1,2,3,4,5,6,7] after the first chunk prefill. |
The test case in this PR didn't fail just because the |
The size 14 is used to test invalid size. The actual size being tested in this case is 16. Meanwhile, I tried all 16, 32 and 64 but none of them failed. |
With max_num_batched_tokens=64, you need sequence length at least to 64 + 2 * block_size to reproduce the problem, 41 is not enough. max_num_batched_tokens=16/32 cannot reproduce the issue, too, as the second block are guaranteed to be recomputed in this PR. |
Ok I could reproduce the issue you pointed out. It actually only happens in block manager v1 as block manager v2 doesn't use this mechanism to mark computed blocks. This may also explain the too good speedup I got. I'll apply your fix in this PR and try to make the test cover this case. |
@sighingnow I applied your commit with some modifications. The test is also changed so that it will fail without fixing the issue in block manager v1. PTAL. |
Thanks! LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. One question is should we just make scheduler handle prefix caching + chunked prefill correctly and make logics in model_runner simplified?
vllm/core/scheduler.py
Outdated
raise ValueError("When enabling chunked prefill and " | ||
"prefix caching, max_num_batched_tokens " | ||
"(chunk size) must be dividable by " | ||
"block size, but got " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also print chunk size and block size along with budget.token_budget % block_size
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It now looks like
ValueError: When enabling chunked prefill and prefix caching, max_num_batched_tokens (chunk size) must be dividable by block size, but got chunk_size (30) % block_size (16) = 14
Will the fix for v2 block manager be addressed by this PR as well? The behavior of v2-block-manager looks quite strange and I'm wondering if #7619 is related. |
I have a fix in my local but it would be a separate PR |
Is it for flash-attn backend only or for all backends? |
I've tested flash-attn and FlashInfer so at least these 2 backends work. Need to test xformers later. |
@comaniac https://github.com/vllm-project/vllm/blob/main/vllm/attention/backends/flashinfer.py#L360 Really supported here? |
Yeah I noticed that too so not fully sure what's going on. Will find some time tomorrow for it. |
Updates:
|
May I know more why you choose to recompute the whole block if it is fully matched? Only recompute the last token is enough and requires no changes in scheduler, and it would be a bit more efficient. |
You're right it would be a bit more efficient to compute only the last token. Meanwhile I found that it might not be that hard to deal with prefix matching in scheduler so that this case would never happen in model runner. I'll give it a try |
Co-authored-by: Tao He <[email protected]> Co-authored-by: Juelianqvq <[email protected]>
b305e0d
to
324fcec
Compare
@sighingnow changed to re-compute only the last token. PTAL. @rkooo567 I've tried to move prefix caching to scheduler and it's actually easy for default scheduler. For chunked prefill, we have to refactor the scheduler (e.g., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good. I'd like to actually also add a warning if the block size is big and prefix caching + CP is enabled (because it can waste a lot of tokens). Maybe if block_size >32, we can print a warning?
@@ -595,3 +595,43 @@ def test_sliding_window_multi_seq(): | |||
|
|||
# assert all blocks are free now | |||
assert block_manager.get_num_free_gpu_blocks() == num_gpu_blocks | |||
|
|||
|
|||
def test_mark_blocks_as_computed_with_prefix_cache_and_chunked_prefill(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have corresponding test in v2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to test v2 because v2 automatically mark touched blocks as computed.
# to avoid partial block matching. | ||
block_size = self.cache_config.block_size | ||
reminder = budget.token_budget % block_size | ||
if reminder != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, should we raise this exception at the engine start time instead and just add assert here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we could just raise here for now because this constraint should be able to be removed once we refactor the schedule to consider prefix caching.
Sure I'll add the warning in a follow-up PR. |
Since this PR has been merged, both #6144 and #6819 can be closed, and are you willing to add me and @sighingnow as the co-authors? @comaniac |
Ah I intended to do that. Actually I put you two as co-authors in one commit of this PR and I thought it should work when the PR is merged but somehow it didn't...let me try to figure out how to fix that. Also cc @simon-mo |
To whom it may concern: after this PR there are still occasional crashes when prefix caching and chunked prefill are enabled at the same time on Nvidia GPUs (inside the flash_attn_varlen_func function in the prefix-enabled attention branch). I investigated the kernel input and find nothing wrong and cannot reproduce it when run the kernel standalone with the pickle saved inputs. I think there are still overflow bugs inside vllm-flash-attention, set the block_size to 256 could fix the issue and the crash disappeared under high pressure. |
Co-authored-by: Tao He <[email protected]> Co-authored-by: Juelianqvq <[email protected]>
This looks like a serious bug that needs to be fixed before it can go to production. Thanks for sharing the workaround solution as well. |
If you are using a model with This means that a user relying on the automatic enabling of chunked prefill might not notice it becoming disabled when they enable prefix caching. Lines 866 to 891 in da1a844
cc @comaniac |
Good point. I'll file another PR to fix it. |
…m-project#7753) Signed-off-by: Alvant <[email protected]>
…m-project#7753) Signed-off-by: LeiWang1999 <[email protected]>
Reference PRs: #6144, #6819
Make @sighingnow and @Juelianqvq as co-authors of this PR.
This PR supports prefix caching and chunked prefill to be enabled together. Different from the reference PRs, this PR simplifies the logic of dealing with partial blocks (thanks to @rkooo567 for the suggestion). Here is the execution flow:
(133//16)*16=112
tokens. Although this approach wastes some token budget, it makes the following process straightforward.A test case for functional correctness is also added.
Throughput benchmarking results:
cc @rkooo567