-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
[V1][Spec Decode] Ngram Spec Decode #12193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 58 commits
Commits
Show all changes
92 commits
Select commit
Hold shift + click to select a range
6071a4b
skeleton
LiuXiaoxuanPKU be798e7
runnable but incorrect
LiuXiaoxuanPKU f0976dd
fix
LiuXiaoxuanPKU 6039933
pass for simple non spec case
LiuXiaoxuanPKU 03cd3dd
pass args and minor variable name bug fix
LiuXiaoxuanPKU 26ba690
minor
LiuXiaoxuanPKU ca5e0dd
minimal example
LiuXiaoxuanPKU 704634c
Merge branch 'main' of github.com:LiuXiaoxuanPKU/vllm into ngram
LiuXiaoxuanPKU 62012d1
minor
LiuXiaoxuanPKU 7bd3f27
format
LiuXiaoxuanPKU b0c5d25
basic test
LiuXiaoxuanPKU d5ee081
minor
LiuXiaoxuanPKU 4e11585
minor
LiuXiaoxuanPKU f915eda
stop checking
LiuXiaoxuanPKU bd8ac07
test for stop checking
LiuXiaoxuanPKU 008a41e
style and disable scheduling chunked requests
LiuXiaoxuanPKU 784b24a
signed-off-by
LiuXiaoxuanPKU f3f6ebc
ngram proposer
LiuXiaoxuanPKU 5e7306e
style and minor output token fix
LiuXiaoxuanPKU a26df8d
partial cleanup & update the kmp
LiuXiaoxuanPKU eeab204
minor
LiuXiaoxuanPKU 6772e07
minor
LiuXiaoxuanPKU a5932a7
fix comments
LiuXiaoxuanPKU a6245e8
minor
LiuXiaoxuanPKU 7890287
merge
LiuXiaoxuanPKU 5d3a31a
change sampled_token_ids to tensor
LiuXiaoxuanPKU f7f4c24
minor
LiuXiaoxuanPKU a1eecd3
remove double free
LiuXiaoxuanPKU c843121
fix bug in input batch token id update
LiuXiaoxuanPKU cdcace5
constant list for spec tokens
LiuXiaoxuanPKU 97ea9f2
Merge branch 'main' into ngram
LiuXiaoxuanPKU 2cab6e6
header
LiuXiaoxuanPKU 65875e8
Merge branch 'main' into ngram
LiuXiaoxuanPKU e30bc0c
bug fix for invalid token id check
LiuXiaoxuanPKU 18fba42
type
LiuXiaoxuanPKU 7f08f9c
Merge commit 'e30bc0cd' into ngram
LiuXiaoxuanPKU ba1d0fd
prefix caching + sd
LiuXiaoxuanPKU fc69953
pass in max_spec_num
LiuXiaoxuanPKU 970a91a
fix block calcaulation
LiuXiaoxuanPKU 7ecb668
minor
LiuXiaoxuanPKU 10b3fe6
merge
LiuXiaoxuanPKU acda923
fix comments
LiuXiaoxuanPKU 2006c75
fix test
LiuXiaoxuanPKU 2ad4f39
fix test
LiuXiaoxuanPKU faafcb6
fix test
LiuXiaoxuanPKU 038e203
merge test
LiuXiaoxuanPKU 4dc0f87
stop checking
LiuXiaoxuanPKU 54508d5
kv cache manager
LiuXiaoxuanPKU c25b9eb
bug fix
LiuXiaoxuanPKU ace9518
merge
LiuXiaoxuanPKU f4ee865
fix scheduler
LiuXiaoxuanPKU d02844a
fix scheduler and tests
LiuXiaoxuanPKU 50ab162
Simplify request
LiuXiaoxuanPKU f3b08f4
rejection sampling tests update
LiuXiaoxuanPKU 036a23f
Merge branch 'ngram' of github.com:LiuXiaoxuanPKU/vllm into ngram
LiuXiaoxuanPKU 840413b
optimize rejection sampler
LiuXiaoxuanPKU 03f6bee
static
LiuXiaoxuanPKU 5fb9ac1
format
LiuXiaoxuanPKU 7c0497e
Update vllm/v1/core/scheduler.py
LiuXiaoxuanPKU e0bd8cc
Update vllm/v1/worker/gpu_model_runner.py
LiuXiaoxuanPKU 95d34f0
fix comments
LiuXiaoxuanPKU 3ff5ead
Merge branch 'ngram' of github.com:LiuXiaoxuanPKU/vllm into ngram
LiuXiaoxuanPKU 4cc5f8d
minor
LiuXiaoxuanPKU e1654b9
merge
LiuXiaoxuanPKU 4086a77
input prepare
LiuXiaoxuanPKU 1e218af
fix input prepare
LiuXiaoxuanPKU 633567a
simplify scheduleroutput
LiuXiaoxuanPKU 888f183
change test case to make output more deterministic
LiuXiaoxuanPKU 353c372
update cpu gpu sync
LiuXiaoxuanPKU 9416792
vectorize rejection sampler
LiuXiaoxuanPKU ab22c2d
merge
LiuXiaoxuanPKU 54c5fa5
fix comments
LiuXiaoxuanPKU 00b9d69
merge
LiuXiaoxuanPKU 4ea2fda
minor
LiuXiaoxuanPKU 0d6d713
minor
LiuXiaoxuanPKU d064a1a
Merge branch 'main' into ngram
LiuXiaoxuanPKU af7322e
fix input prepare bug
LiuXiaoxuanPKU 4dec71d
fix
LiuXiaoxuanPKU 8758b96
fix test
LucasWilkinson 8929ad1
fix comments
LiuXiaoxuanPKU 65bb67f
minor fix
LiuXiaoxuanPKU 992aab8
make test more deterministic
LiuXiaoxuanPKU 6608e31
Merge branch 'main' into ngram
LiuXiaoxuanPKU e298bb3
merge conflict
LiuXiaoxuanPKU 4329970
Merge branch 'ngram' of github.com:LiuXiaoxuanPKU/vllm into ngram
LiuXiaoxuanPKU 4e015ae
fix
LiuXiaoxuanPKU 5fc5264
fix rejection sampler tests
LiuXiaoxuanPKU b56a8e4
fix num_token
LiuXiaoxuanPKU a669c1c
merge
LiuXiaoxuanPKU 2cbf57e
fix scheduler test
LiuXiaoxuanPKU 29d3054
fix scheduler test, minor
LiuXiaoxuanPKU 2dc7909
fix gpu model runner
LiuXiaoxuanPKU File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
# SPDX-License-Identifier: Apache-2.0 | ||
import pytest | ||
|
||
from vllm import LLM, SamplingParams | ||
|
||
|
||
@pytest.fixture | ||
def test_prompts(): | ||
return [ | ||
"Can you repeat the sentence ten times, this is a sentence?", | ||
"This is a basic spec decode test", | ||
] | ||
|
||
|
||
@pytest.fixture | ||
def sampling_config(): | ||
# Only support greedy for now | ||
return SamplingParams(temperature=0, max_tokens=100, ignore_eos=False) | ||
|
||
|
||
@pytest.fixture | ||
def model_name(): | ||
return "meta-llama/Meta-Llama-3-8B-Instruct" | ||
|
||
|
||
def test_ngram_correctness(monkeypatch, test_prompts, sampling_config, | ||
model_name): | ||
''' | ||
Compare the outputs of a original LLM and a speculative LLM | ||
should be the same when using ngram speculative decoding. | ||
''' | ||
with monkeypatch.context() as m: | ||
m.setenv("VLLM_USE_V1", "1") | ||
|
||
ref_llm = LLM(model=model_name) | ||
ref_outputs = ref_llm.generate(test_prompts, sampling_config) | ||
del ref_llm | ||
|
||
spec_llm = LLM(model=model_name, | ||
speculative_model='[ngram]', | ||
ngram_prompt_lookup_max=5, | ||
ngram_prompt_lookup_min=3, | ||
num_speculative_tokens=3) | ||
spec_outputs = spec_llm.generate(test_prompts, sampling_config) | ||
for ref_output, spec_output in zip(ref_outputs, spec_outputs): | ||
assert ref_output.outputs[0].text == spec_output.outputs[0].text, \ | ||
(f"ref_output: {ref_output.outputs[0].text}," | ||
f"spec_output: {spec_output.outputs[0].text}") | ||
del spec_llm |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.