Skip to content

Add documentation for expectations and standards related to model formats/configs #24

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

markurtz
Copy link
Member

Summary

Documentation and examples added to standardize the model serialization formats, intended compatibility, and deserialization formats.

Testing

Self review, PR reviews

@markurtz markurtz requested a review from Copilot May 20, 2025 22:01
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.

📦 Build Artifacts Available
The build artifacts (.whl and .tar.gz) have been successfully generated and are available for download: https://github.com/neuralmagic/speculators/actions/runs/15148722953/artifacts/3163954010.
They will be retained for up to 30 days.

📦 Build Artifacts Available
The build artifacts (.whl and .tar.gz) have been successfully generated and are available for download: https://github.com/neuralmagic/speculators/actions/runs/15148815986/artifacts/3163986401.
They will be retained for up to 30 days.

📦 Build Artifacts Available
The build artifacts (.whl and .tar.gz) have been successfully generated and are available for download: https://github.com/neuralmagic/speculators/actions/runs/15149050233/artifacts/3164069717.
They will be retained for up to 30 days.

📦 Build Artifacts Available
The build artifacts (.whl and .tar.gz) have been successfully generated and are available for download: https://github.com/neuralmagic/speculators/actions/runs/15170276474/artifacts/3171360795.
They will be retained for up to 30 days.

"depth": 5
}
],
"default_proposal_method": "tree",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just using this EAGLE example to talk this through ...

Currently, vLLM only supports greedy sampling with EAGLE. See https://github.com/vllm-project/vllm/blob/58738772410c5e0d60b61db39538a9b313d2d7ad/vllm/v1/spec_decode/eagle.py#L182

The discussion in vllm-project/vllm#16899 suggests that if vLLM does support random sampling, that it should follow the sampling params of the target model in order to have the best chance of matching its distribution

Tree decoding is proposed in vllm-project/vllm#17560 - tree depth and num_spec_expand are configurable via --speculative-config

So ...

What does it mean for the creator of a speculator model to list proposal methods like this?

Is this a recommendation that tree decoding with those parameters is optimal?

Might there be multiple tree decoding "profiles" with different values? How would a user choose between them?

What if the inference engine does not support that proposal method? Can they fall back to greedy?

What does it mean to specify draft_tokens = 5? Is this also a recommendation? It should be used unless the user specifies their own value?

HTH

"default_proposal_method": "tree",
"verifier": {
"name_or_path": "meta-llama/Llama-3.1-8B-Instruct",
"architectures": ["LlamaForCausalLM"],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking through the implications of this ...

Is this supposed to override the verifier model config (assuming they don't match?)

What would the use case be to override?

I'd imagine it would be quite tricky to make this override work in vLLM - by the time we load this config, we've probably done some setup of the verifier based on its config

If it's not an override, does vLLM just ignore it? That would be a bit weird ...

- `"eagle"`, `"eagle_2"`, `"eagle_3"` - Eagle speculator variants based on Transformer architecture for the draft model
- `"hass"` - Similar to Eagle based on the Transformer architecture for the draft model
- `"mlp_speculator"` - Based on a multi-layer perceptron (MLP) architecture for the draft model
- `"specdec"` - An independent speculator model
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be super convenient if we could match the existing names used in vLLM

+ALGORITHMS = [
+    "eagle",
+    "eagle3",
+    "medusa",
+    "mlp_speculator",
+    "deepseek_mtp",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants