:::{figure} ./assets/logos/vllm-logo-text-light.png :align: center :alt: vLLM :class: no-scaled-link :width: 60% :::
:::{raw} html
Easy, fast, and cheap LLM serving for everyone
<script async defer src="https://buttons.github.io/buttons.js"></script> Star Watch Fork
:::vLLM is a fast and easy-to-use library for LLM inference and serving.
Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
vLLM is fast with:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with PagedAttention
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantization: GPTQ, AWQ, INT4, INT8, and FP8
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
- Speculative decoding
- Chunked prefill
vLLM is flexible and easy to use with:
- Seamless integration with popular HuggingFace models
- High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
- Tensor parallelism and pipeline parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
- Prefix caching support
- Multi-lora support
For more information, check out the following:
- vLLM announcing blog post (intro to PagedAttention)
- vLLM paper (SOSP 2023)
- How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al.
- vLLM Meetups
% How to start using vLLM?
:::{toctree} :caption: Getting Started :maxdepth: 1
getting_started/installation/index getting_started/quickstart getting_started/examples/examples_index getting_started/troubleshooting getting_started/faq :::
% What does vLLM support?
:::{toctree} :caption: Models :maxdepth: 1
models/generative_models models/pooling_models models/supported_models models/extensions/index :::
% Additional capabilities
:::{toctree} :caption: Features :maxdepth: 1
features/quantization/index features/lora features/tool_calling features/reasoning_outputs features/structured_outputs features/automatic_prefix_caching features/disagg_prefill features/spec_decode features/compatibility_matrix :::
% Details about running vLLM
:::{toctree} :caption: Inference and Serving :maxdepth: 1
serving/offline_inference serving/openai_compatible_server serving/multimodal_inputs serving/distributed_serving serving/metrics serving/engine_args serving/env_vars serving/usage_stats serving/integrations/index :::
% Scaling up vLLM for production
:::{toctree} :caption: Deployment :maxdepth: 1
deployment/docker deployment/k8s deployment/nginx deployment/frameworks/index deployment/integrations/index :::
% Making the most out of vLLM
:::{toctree} :caption: Performance :maxdepth: 1
performance/optimization performance/benchmarks :::
% Explanation of vLLM internals
:::{toctree} :caption: Design Documents :maxdepth: 2
design/arch_overview design/huggingface_integration design/plugin_system design/kernel/paged_attention design/mm_processing design/automatic_prefix_caching design/multiprocessing :::
:::{toctree} :caption: V1 Design Documents :maxdepth: 2
design/v1/prefix_caching :::
% How to contribute to the vLLM project
:::{toctree} :caption: Developer Guide :maxdepth: 2
contributing/overview contributing/profiling/profiling_index contributing/dockerfile/dockerfile contributing/model/index contributing/vulnerability_management :::
% Technical API specifications
:::{toctree} :caption: API Reference :maxdepth: 2
api/offline_inference/index api/engine/index api/inference_params api/multimodal/index api/model/index :::
% Latest news and acknowledgements
:::{toctree} :caption: Community :maxdepth: 1
community/blog community/meetups community/sponsors :::
- {ref}
genindex
- {ref}
modindex