Skip to content

Latest commit

 

History

History
203 lines (157 loc) · 4.78 KB

index.md

File metadata and controls

203 lines (157 loc) · 4.78 KB

Welcome to vLLM

:::{figure} ./assets/logos/vllm-logo-text-light.png :align: center :alt: vLLM :class: no-scaled-link :width: 60% :::

:::{raw} html

Easy, fast, and cheap LLM serving for everyone

<script async defer src="https://buttons.github.io/buttons.js"></script> Star Watch Fork

:::

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Fast model execution with CUDA/HIP graph
  • Quantization: GPTQ, AWQ, INT4, INT8, and FP8
  • Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
  • Speculative decoding
  • Chunked prefill

vLLM is flexible and easy to use with:

  • Seamless integration with popular HuggingFace models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor parallelism and pipeline parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server
  • Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
  • Prefix caching support
  • Multi-lora support

For more information, check out the following:

Documentation

% How to start using vLLM?

:::{toctree} :caption: Getting Started :maxdepth: 1

getting_started/installation/index getting_started/quickstart getting_started/examples/examples_index getting_started/troubleshooting getting_started/faq :::

% What does vLLM support?

:::{toctree} :caption: Models :maxdepth: 1

models/generative_models models/pooling_models models/supported_models models/extensions/index :::

% Additional capabilities

:::{toctree} :caption: Features :maxdepth: 1

features/quantization/index features/lora features/tool_calling features/reasoning_outputs features/structured_outputs features/automatic_prefix_caching features/disagg_prefill features/spec_decode features/compatibility_matrix :::

% Details about running vLLM

:::{toctree} :caption: Inference and Serving :maxdepth: 1

serving/offline_inference serving/openai_compatible_server serving/multimodal_inputs serving/distributed_serving serving/metrics serving/engine_args serving/env_vars serving/usage_stats serving/integrations/index :::

% Scaling up vLLM for production

:::{toctree} :caption: Deployment :maxdepth: 1

deployment/docker deployment/k8s deployment/nginx deployment/frameworks/index deployment/integrations/index :::

% Making the most out of vLLM

:::{toctree} :caption: Performance :maxdepth: 1

performance/optimization performance/benchmarks :::

% Explanation of vLLM internals

:::{toctree} :caption: Design Documents :maxdepth: 2

design/arch_overview design/huggingface_integration design/plugin_system design/kernel/paged_attention design/mm_processing design/automatic_prefix_caching design/multiprocessing :::

:::{toctree} :caption: V1 Design Documents :maxdepth: 2

design/v1/prefix_caching :::

% How to contribute to the vLLM project

:::{toctree} :caption: Developer Guide :maxdepth: 2

contributing/overview contributing/profiling/profiling_index contributing/dockerfile/dockerfile contributing/model/index contributing/vulnerability_management :::

% Technical API specifications

:::{toctree} :caption: API Reference :maxdepth: 2

api/offline_inference/index api/engine/index api/inference_params api/multimodal/index api/model/index :::

% Latest news and acknowledgements

:::{toctree} :caption: Community :maxdepth: 1

community/blog community/meetups community/sponsors :::

Indices and tables

  • {ref}genindex
  • {ref}modindex