Welcome to vLLM

:::{figure} ./assets/logos/vllm-logo-text-light.png :align: center :alt: vLLM :class: no-scaled-link :width: 60% :::

:::{raw} html

Easy, fast, and cheap LLM serving for everyone

:::

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantization: GPTQ, AWQ, INT4, INT8, and FP8
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
Speculative decoding
Chunked prefill

vLLM is flexible and easy to use with:

Seamless integration with popular HuggingFace models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor parallelism and pipeline parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
Prefix caching support
Multi-lora support

For more information, check out the following:

vLLM announcing blog post (intro to PagedAttention)
vLLM paper (SOSP 2023)
How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al.
vLLM Meetups

Documentation

% How to start using vLLM?

:::{toctree} :caption: Getting Started :maxdepth: 1

getting_started/installation/index getting_started/quickstart getting_started/examples/examples_index getting_started/troubleshooting getting_started/faq :::

% What does vLLM support?

:::{toctree} :caption: Models :maxdepth: 1

models/generative_models models/pooling_models models/supported_models models/extensions/index :::

% Additional capabilities

:::{toctree} :caption: Features :maxdepth: 1

features/quantization/index features/lora features/tool_calling features/reasoning_outputs features/structured_outputs features/automatic_prefix_caching features/disagg_prefill features/spec_decode features/compatibility_matrix :::

% Details about running vLLM

:::{toctree} :caption: Inference and Serving :maxdepth: 1

serving/offline_inference serving/openai_compatible_server serving/multimodal_inputs serving/distributed_serving serving/metrics serving/engine_args serving/env_vars serving/usage_stats serving/integrations/index :::

% Scaling up vLLM for production

:::{toctree} :caption: Deployment :maxdepth: 1

deployment/docker deployment/k8s deployment/nginx deployment/frameworks/index deployment/integrations/index :::

% Making the most out of vLLM

:::{toctree} :caption: Performance :maxdepth: 1

performance/optimization performance/benchmarks :::

% Explanation of vLLM internals

:::{toctree} :caption: Design Documents :maxdepth: 2

design/arch_overview design/huggingface_integration design/plugin_system design/kernel/paged_attention design/mm_processing design/automatic_prefix_caching design/multiprocessing :::

:::{toctree} :caption: V1 Design Documents :maxdepth: 2

design/v1/prefix_caching :::

% How to contribute to the vLLM project

:::{toctree} :caption: Developer Guide :maxdepth: 2

contributing/overview contributing/profiling/profiling_index contributing/dockerfile/dockerfile contributing/model/index contributing/vulnerability_management :::

% Technical API specifications

:::{toctree} :caption: API Reference :maxdepth: 2

api/offline_inference/index api/engine/index api/inference_params api/multimodal/index api/model/index :::

% Latest news and acknowledgements

:::{toctree} :caption: Community :maxdepth: 1

community/blog community/meetups community/sponsors :::

Indices and tables

{ref}genindex
{ref}modindex

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.md

index.md

Welcome to vLLM

Documentation

Indices and tables

Files

index.md

Latest commit

History

index.md

File metadata and controls

Welcome to vLLM

Documentation

Indices and tables