-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
[RFC]: AWS Neuron 2.23 NxD Inference with vLLM V0 #15970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Speculative Decoding for Neuron currently doesn't use EAGLEConfig. Make changes to the speculative decoding logic to use EAGLEConfig instead of having a hardware level exception in vllm/config.py. |
Currently, neuron models fail when current_sequence_length + num_lookahead_slots > max_model_len. Let’s say max_len is 1024 and speculation length is 7. And we are currently in a sequence that is 1020 tokens long. vLLM does not stop the sequence. But trying to do another iteration will fail because vLLM knows that 1020 + 7 is > 1024 and will try to allocate a new block. We need to mitigate this by ensuring that we don't fail with an assertion error when the above condition is hit. |
Uh oh!
There was an error while loading. Please reload this page.
Motivation.
AWS Neuron has released the NeuronX Distributed (NxD) Inference library, a PyTorch-based solution that has performance optimizations relevant to AWS Trainium and Inferentia instances. NxD Inference is the path forward for optimized inference on Neuron. The Transformers NeuronX (TNx) library will soon reach the end of support.
This RFC integrates NxD Inference into vLLM and adds minor features to TNx. The integration currently targets vLLM’s V0 architecture, with plans to migrate to V1 Architecture.
These changes streamline Neuron Serving with vLLM while maintaining while maintaining compatibility and performance for inference workloads on AWS Trainium and Inferentia.
AWS Neuron is committed to supporting vLLM and is planning an engineering roadmap with deeper integration. We will share the next RFC with the vLLM community for feedback once it’s ready.
We are adding the following features to the current RFC:
Note: The changes will be isolated to Neuron-specific logic and will not impact other platforms.
Proposed Change.
NeuronX Distributed (NxD) Inference Support
Speculative Decoding
neuron_worker.py
if speculation is enabled.Dynamic On-device Sampling
top_k
,top_p
,temperature
) and add them toexecute_model()
.Multi-modal model support
Quantized model support (limited to TNx)
Multi-LoRA Serving
Feedback Period.
1 week (due on April 9, 2025).
CC List.
Any Other Things.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: