-
Notifications
You must be signed in to change notification settings - Fork 0
Add documentation for expectations and standards related to model formats/configs #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.
📦 Build Artifacts Available |
📦 Build Artifacts Available |
📦 Build Artifacts Available |
📦 Build Artifacts Available |
"depth": 5 | ||
} | ||
], | ||
"default_proposal_method": "tree", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just using this EAGLE example to talk this through ...
Currently, vLLM only supports greedy sampling with EAGLE. See https://github.com/vllm-project/vllm/blob/58738772410c5e0d60b61db39538a9b313d2d7ad/vllm/v1/spec_decode/eagle.py#L182
The discussion in vllm-project/vllm#16899 suggests that if vLLM does support random sampling, that it should follow the sampling params of the target model in order to have the best chance of matching its distribution
Tree decoding is proposed in vllm-project/vllm#17560 - tree depth and num_spec_expand
are configurable via --speculative-config
So ...
What does it mean for the creator of a speculator model to list proposal methods like this?
Is this a recommendation that tree decoding with those parameters is optimal?
Might there be multiple tree decoding "profiles" with different values? How would a user choose between them?
What if the inference engine does not support that proposal method? Can they fall back to greedy?
What does it mean to specify draft_tokens = 5
? Is this also a recommendation? It should be used unless the user specifies their own value?
HTH
"default_proposal_method": "tree", | ||
"verifier": { | ||
"name_or_path": "meta-llama/Llama-3.1-8B-Instruct", | ||
"architectures": ["LlamaForCausalLM"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking through the implications of this ...
Is this supposed to override the verifier model config (assuming they don't match?)
What would the use case be to override?
I'd imagine it would be quite tricky to make this override work in vLLM - by the time we load this config, we've probably done some setup of the verifier based on its config
If it's not an override, does vLLM just ignore it? That would be a bit weird ...
- `"eagle"`, `"eagle_2"`, `"eagle_3"` - Eagle speculator variants based on Transformer architecture for the draft model | ||
- `"hass"` - Similar to Eagle based on the Transformer architecture for the draft model | ||
- `"mlp_speculator"` - Based on a multi-layer perceptron (MLP) architecture for the draft model | ||
- `"specdec"` - An independent speculator model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be super convenient if we could match the existing names used in vLLM
+ALGORITHMS = [
+ "eagle",
+ "eagle3",
+ "medusa",
+ "mlp_speculator",
+ "deepseek_mtp",
Summary
Documentation and examples added to standardize the model serialization formats, intended compatibility, and deserialization formats.
Testing
Self review, PR reviews