[RFC]: AWS Neuron 2.23 NxD Inference with vLLM V0 #15970

mrinalks · 2025-04-02T20:41:10Z

Motivation.

AWS Neuron has released the NeuronX Distributed (NxD) Inference library, a PyTorch-based solution that has performance optimizations relevant to AWS Trainium and Inferentia instances. NxD Inference is the path forward for optimized inference on Neuron. The Transformers NeuronX (TNx) library will soon reach the end of support.

This RFC integrates NxD Inference into vLLM and adds minor features to TNx. The integration currently targets vLLM’s V0 architecture, with plans to migrate to V1 Architecture.

These changes streamline Neuron Serving with vLLM while maintaining while maintaining compatibility and performance for inference workloads on AWS Trainium and Inferentia.

AWS Neuron is committed to supporting vLLM and is planning an engineering roadmap with deeper integration. We will share the next RFC with the vLLM community for feedback once it’s ready.

We are adding the following features to the current RFC:

NeuronX Distributed (NxD) Inference Support
Speculative Decoding
Dynamic On-device Sampling
Quantized Model Support (limited to TNx)
Multi-Modal Support (Llama-3.2)
Multi-LoRA Serving

Note: The changes will be isolated to Neuron-specific logic and will not impact other platforms.

Proposed Change.

NeuronX Distributed (NxD) Inference Support
1. Allow customers to select a framework based on preference or availability. Default to neuronx-distributed-inference (NxD); if unavailable, fall back to transformers-neuronx (TNx).
2. Support inference using NxD by adding a worker/neuronx_distributed_model_runner.py
3. Add framework detection utility that returns the current framework in use.
Speculative Decoding
1. To enable speculative decoding with NxD, we added worker/multi_step_neuronx_distributed_model_runner.py.
2. To enable speculative decoding with TnX, we added worker/multi_step_neuron_model_runner.py. This model runner is chosen in neuron_worker.py if speculation is enabled.
Dynamic On-device Sampling
1. Extract the sampling params (top_k, top_p, temperature) and add them to execute_model().
Multi-modal model support
1. Add support for MLlama multi-modal models
Quantized model support (limited to TNx)
1. Support INT8 and FP8 quantizations
Multi-LoRA Serving
1. Allow loading and using LoRA adapters with NxD.
2. Supports only loading of Lora adapters at server startup. Dynamic loading of LoRA will be supported along with V1 Support.

Feedback Period.

1 week (due on April 9, 2025).

CC List.

Any Other Things.

The RFC is focused on V0 architecture and does not implement V1 support for Neuron. V1 architecture support is being actively planned and will be shared in a separate RFC.
The RFC introduces significant code changes to Neuron-related paths, which are organized into feature-specific PRs to streamline the review process.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

aws-satyajith · 2025-04-30T21:25:16Z

Speculative Decoding for Neuron currently doesn't use EAGLEConfig. Make changes to the speculative decoding logic to use EAGLEConfig instead of having a hardware level exception in vllm/config.py.

aws-satyajith · 2025-05-02T23:33:57Z

Currently, neuron models fail when current_sequence_length + num_lookahead_slots > max_model_len.

Let’s say max_len is 1024 and speculation length is 7. And we are currently in a sequence that is 1020 tokens long. vLLM does not stop the sequence. But trying to do another iteration will fail because vLLM knows that 1020 + 7 is > 1024 and will try to allocate a new block.
That won’t work because w/o block based K/V we only have max seq num blocks i.e. each sequence only has one block.

We need to mitigate this by ensuring that we don't fail with an assertion error when the above condition is hit.

mrinalks added the RFC label Apr 2, 2025

aws-satyajith mentioned this issue May 15, 2025

Neuron up mistral #18222

Merged

This was referenced May 16, 2025

Update default neuron config for speculation #18274

Merged

Order sequence ids + config update to support specifying custom quantization layers #18279

Merged

This was referenced May 19, 2025

[Neuron] Support quantization on neuron #18283

Merged

[Neuron] Add multi-LoRA support for Neuron. #18284

Merged

This was referenced May 22, 2025

[Neuron] Remove bypass on EAGLEConfig and add a test #18514

Merged

[Doc][Neuron] Update documentation for Neuron #18868

Merged

mrinalks changed the title ~~[RFC]: AWS Neuron 2.22 NxD Inference with vLLM V0~~ [RFC]: AWS Neuron 2.23 NxD Inference with vLLM V0 May 29, 2025

aws-satyajith mentioned this issue May 29, 2025

[Neuron] Add Multi-Modal model support for Neuron #18921

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: AWS Neuron 2.23 NxD Inference with vLLM V0 #15970

[RFC]: AWS Neuron 2.23 NxD Inference with vLLM V0 #15970

mrinalks commented Apr 2, 2025 •

edited

Loading

aws-satyajith commented Apr 30, 2025

Uh oh!

aws-satyajith commented May 2, 2025

Uh oh!

Uh oh!

[RFC]: AWS Neuron 2.23 NxD Inference with vLLM V0 #15970

[RFC]: AWS Neuron 2.23 NxD Inference with vLLM V0 #15970

Comments

mrinalks commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

aws-satyajith commented Apr 30, 2025

Uh oh!

aws-satyajith commented May 2, 2025

Uh oh!

mrinalks commented Apr 2, 2025 •

edited

Loading