-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
Support SHA256 as hash function in prefix caching #15297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Only nits.
Signed-off-by: Marko Rosenmueller <[email protected]>
Signed-off-by: Marko Rosenmueller <[email protected]>
Signed-off-by: Marko Rosenmueller <[email protected]>
Signed-off-by: Marko Rosenmueller <[email protected]>
Signed-off-by: Marko Rosenmueller <[email protected]>
Signed-off-by: Marko Rosenmueller <[email protected]>
Signed-off-by: Marko Rosenmueller <[email protected]>
CI fails in V1 Test for
Seems unrelated. Before there is also a test failure in v1/entrypoints/llm/test_struct_output_generate.py::test_guided_json_completion[Qwen/Qwen2.5-1.5B-Instruct-guidance]:
The input contains a series of '\n' and there is no detailed error message w.r.t. the failure. I am rebasing on main when adding review comments to see if that helps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Marko Rosenmueller <[email protected]>
Signed-off-by: Marko Rosenmueller <[email protected]> Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Marko Rosenmueller <[email protected]> Signed-off-by: xinyuxiao <[email protected]>
Signed-off-by: Marko Rosenmueller <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
Signed-off-by: Marko Rosenmueller <[email protected]>
Signed-off-by: Marko Rosenmueller <[email protected]>
Signed-off-by: Marko Rosenmueller <[email protected]>
Signed-off-by: Marko Rosenmueller <[email protected]> Signed-off-by: Mu Huai <[email protected]>
Context
Prefix-caching may access blocks from unrelated requests due to hash collision. If there is a large number of blocks in the cache (e.g., 20k), a hash collision is rare but can happen. Also, as
hash()
is used, this could potentially be exploited.Suggested change
To fix this, I want to replace
hash()
byhashlib.sha256()
, increasing the hash size to 32byte.Replacing hash was already discussed in #12621
Runtime micro benchmark
Benchmarking the provided implementation using sha256, the added overhead for 50k tokens context length is around 3ms on an Apple M2. W.r.t. a TTFT of 1.6s with prefix caching of 50k tokens on an H100, the added overhead is rather small.
Setup for measuring mean runtime for hashing large context:
Apple M2 (mean error < 40us):
Intel Xeon @ 2.20GHz (mean error < 90us)
sha256 vs. hash
: for ref., comparing sha256+pickle vs. hash+pickle to see the sha256 vs. hash overheadFull code to be added.
Please take a look @comaniac