Skip to content

Commit 479b843

Browse files
tjtanaaywang96rafvasqIsotr0pyDarkLight1337
authored
[MFM-2025-02-03] Merge Main to llama fp8; With Faster ROCm Paged Attention (#399)
* [V1] Avoid sending text prompt to core engine (vllm-project#11963) Signed-off-by: Roger Wang <[email protected]> * [CI/Build] Add markdown linter (vllm-project#11857) Signed-off-by: Rafael Vasquez <[email protected]> * [Model] Initialize support for Deepseek-VL2 models (vllm-project#11578) Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Hardware][CPU] Multi-LoRA implementation for the CPU backend (vllm-project#11100) Signed-off-by: Akshat Tripathi <[email protected]> Signed-off-by: Oleg Mosalov <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Oleg Mosalov <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> * [Hardware][TPU] workaround fix for MoE on TPU (vllm-project#11764) * [V1][Core][1/n] Logging and Metrics (vllm-project#11962) Signed-off-by: [email protected] <[email protected]> * [Model] Support GGUF models newly added in `transformers` 4.46.0 (vllm-project#9685) Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [V1] [2/n] Logging and Metrics - `OutputProcessor` Abstraction (vllm-project#11973) Signed-off-by: [email protected] <[email protected]> * [MISC] fix typo in kv transfer send recv test (vllm-project#11983) * [Bug] Fix usage of `.transpose()` and `.view()` consecutively. (vllm-project#11979) * [CI][Spec Decode] fix: broken test for EAGLE model (vllm-project#11972) Signed-off-by: Sungjae Lee <[email protected]> * [Misc] Fix Deepseek V2 fp8 kv-scale remapping (vllm-project#11947) Signed-off-by: Yida Wu <[email protected]> * [Misc]Minor Changes about Worker (vllm-project#11555) Signed-off-by: Chenguang Li <[email protected]> * [platform] add ray_device_key (vllm-project#11948) Signed-off-by: youkaichao <[email protected]> * Fix Max Token ID for Qwen-VL-Chat (vllm-project#11980) Signed-off-by: Alex-Brooks <[email protected]> * [Kernel] unified_attention for Attention.forward (vllm-project#11967) Signed-off-by: Chen Zhang <[email protected]> * [Doc][V1] Update model implementation guide for V1 support (vllm-project#11998) Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Doc] Organise installation documentation into categories and tabs (vllm-project#11935) Signed-off-by: Harry Mellor <[email protected]> * [platform] add device_control env var (vllm-project#12009) Signed-off-by: youkaichao <[email protected]> * [Platform] Move get_punica_wrapper() function to Platform (vllm-project#11516) Signed-off-by: Shanshan Shen <[email protected]> * bugfix: Fix signature mismatch in benchmark's `get_tokenizer` function (vllm-project#11982) Signed-off-by: elijah <[email protected]> * [Doc] Fix build from source and installation link in README.md (vllm-project#12013) Signed-off-by: Yikun <[email protected]> * Using list * [Bugfix] Fix deepseekv3 gate bias error (vllm-project#12002) Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]> * Revert "[misc] improve memory profiling (vllm-project#11809)" This reverts commit 889e662. * Multi-lingual P3L (#356) * Commiting the *multilingual* P3L test. * Created a *multi-lingual* P3L test. * Making ruff happy. * . * Added a reference to the language-scripture Confluence table. * Typo fixing. * Harmonizing naming. * Fixing comments in the header. --------- Co-authored-by: Alexei V. Ivanov <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> * Trying to make scales work with compileable attention * [Docs] Add Sky Computing Lab to project intro (vllm-project#12019) Signed-off-by: Woosuk Kwon <[email protected]> * [HPU][Bugfix] set_forward_context and CI test execution (vllm-project#12014) Signed-off-by: Konrad Zawora <[email protected]> * [Doc] Update Quantization Hardware Support Documentation (vllm-project#12025) Signed-off-by: tjtanaa <[email protected]> Co-authored-by: tjtanaa <[email protected]> * [HPU][misc] add comments for explanation (vllm-project#12034) Signed-off-by: youkaichao <[email protected]> * [Bugfix] Fix various bugs in multi-modal processor (vllm-project#12031) Signed-off-by: DarkLight1337 <[email protected]> * [Kernel] Revert the API change of Attention.forward (vllm-project#12038) Signed-off-by: Chen Zhang <[email protected]> * [Platform] Add output for Attention Backend (vllm-project#11981) Signed-off-by: wangxiyuan <[email protected]> * [Bugfix][Kernel] Give unique name to BlockSparseFlashAttention (vllm-project#12040) Signed-off-by: Chen Zhang <[email protected]> * Explain where the engine args go when using Docker (vllm-project#12041) Signed-off-by: Harry Mellor <[email protected]> * Docs lint * [Doc]: Update the Json Example of the `Engine Arguments` document (vllm-project#12045) * [Misc] Merge bitsandbytes_stacked_params_mapping and packed_modules_mapping (vllm-project#11924) Signed-off-by: Jee Jee Li <[email protected]> * [Kernel] Support MulAndSilu (vllm-project#11624) Signed-off-by: Jee Jee Li <[email protected]> * [HPU][Bugfix] Don't use /dev/accel/accel0 for HPU autodetection in setup.py (vllm-project#12046) Signed-off-by: Konrad Zawora <[email protected]> * [Platform] move current_memory_usage() into platform (vllm-project#11369) Signed-off-by: Shanshan Shen <[email protected]> * [V1][BugFix] Fix edge case in VLM scheduling (vllm-project#12065) Signed-off-by: Woosuk Kwon <[email protected]> * [Misc] Add multipstep chunked-prefill support for FlashInfer (vllm-project#10467) * [core] Turn off GPU communication overlap for Ray executor (vllm-project#12051) Signed-off-by: Rui Qiao <[email protected]> * [core] platform agnostic executor via collective_rpc (vllm-project#11256) Signed-off-by: youkaichao <[email protected]> * [Doc] Update examples to remove SparseAutoModelForCausalLM (vllm-project#12062) Signed-off-by: Kyle Sayers <[email protected]> * [V1][Prefix Cache] Move the logic of num_computed_tokens into KVCacheManager (vllm-project#12003) * Fix: cases with empty sparsity config (vllm-project#12057) Signed-off-by: Rahul Tuli <[email protected]> * Type-fix: make execute_model output type optional (vllm-project#12020) * [Platform] Do not raise error if _Backend is not found (vllm-project#12023) Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: Mengqing Cao <[email protected]> Co-authored-by: Mengqing Cao <[email protected]> * [Model]: Support internlm3 (vllm-project#12037) * Misc: allow to use proxy in `HTTPConnection` (vllm-project#12042) Signed-off-by: Yuan Zhou <[email protected]> * [Misc][Quark] Upstream Quark format to VLLM (vllm-project#10765) Signed-off-by: kewang-xlnx <[email protected]> Signed-off-by: kewang2 <[email protected]> Co-authored-by: kewang2 <[email protected]> Co-authored-by: Michael Goin <[email protected]> * [Doc]: Update `OpenAI-Compatible Server` documents (vllm-project#12082) * [Bugfix] use right truncation for non-generative tasks (vllm-project#12050) Signed-off-by: Joe Runde <[email protected]> * [V1][Core] Autotune encoder cache budget (vllm-project#11895) Signed-off-by: Roger Wang <[email protected]> * [Bugfix] Fix _get_lora_device for HQQ marlin (vllm-project#12090) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * Allow hip sources to be directly included when compiling for rocm. (vllm-project#12087) * [Core] Default to using per_token quantization for fp8 when cutlass is supported. (vllm-project#8651) Signed-off-by: mgoin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: mgoin <[email protected]> * [Doc] Add documentation for specifying model architecture (vllm-project#12105) * Various cosmetic/comment fixes (vllm-project#12089) Signed-off-by: mgoin <[email protected]> * [Bugfix] Remove hardcoded `head_size=256` for Deepseek v2 and v3 (vllm-project#12067) Signed-off-by: Isotr0py <[email protected]> * Support torchrun and SPMD-style offline inference (vllm-project#12071) Signed-off-by: youkaichao <[email protected]> * [core] LLM.collective_rpc interface and RLHF example (vllm-project#12084) Signed-off-by: youkaichao <[email protected]> * [Bugfix] Fix max image feature size for Llava-one-vision (vllm-project#12104) Signed-off-by: Roger Wang <[email protected]> * Enable user marker for vllm profiling (#357) * Enable user marker for vllm profiling --------- Co-authored-by: Gregory Shtrasberg <[email protected]> * [misc] Add LoRA kernel micro benchmarks (vllm-project#11579) * [Model] Add support for deepseek-vl2-tiny model (vllm-project#12068) Signed-off-by: Isotr0py <[email protected]> * Deepseek V3 support (#364) * Changing the hard coded datatype to see if it's enough for the model to work * Picking the upstrteam moe kernel version * make upstream fix for v3 also works for rocm v2 * Conditional fnuz dtype * Requantizing from fn to fnuz * Requantizing moe as well * Actually requantizing moe weights * Conditional requantization and assert on padding in block quant * Format --------- Co-authored-by: charlifu <[email protected]> * [Bugfix] Set enforce_eager automatically for mllama (vllm-project#12127) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix] Fix a path bug in disaggregated prefill example script. (vllm-project#12121) Signed-off-by: Kuntai Du <[email protected]> * [CI]add genai-perf benchmark in nightly benchmark (vllm-project#10704) Signed-off-by: Kunshang Ji <[email protected]> * [Doc] Add instructions on using Podman when SELinux is active (vllm-project#12136) Signed-off-by: Yuan Tang <[email protected]> * [Bugfix] Fix issues in CPU build Dockerfile (vllm-project#12135) Signed-off-by: Yuan Tang <[email protected]> * [BugFix] add more `is not None` check in VllmConfig.__post_init__ (vllm-project#12138) Signed-off-by: Chen Zhang <[email protected]> * [Misc] Add deepseek_vl2 chat template (vllm-project#12143) Signed-off-by: Isotr0py <[email protected]> * [ROCm][MoE] moe tuning support for rocm (vllm-project#12049) Signed-off-by: Divakar Verma <[email protected]> * [V1] Move more control of kv cache initialization from model_executor to EngineCore (vllm-project#11960) Signed-off-by: Chen Zhang <[email protected]> Co-authored-by: Cody Yu <[email protected]> * [Misc][LoRA] Improve the readability of LoRA error messages (vllm-project#12102) Signed-off-by: Jee Jee Li <[email protected]> * [CI/Build][CPU][Bugfix] Fix CPU CI (vllm-project#12150) Signed-off-by: jiang1.li <[email protected]> * [core] allow callable in collective_rpc (vllm-project#12151) Signed-off-by: youkaichao <[email protected]> * [Bugfix] Fix score api for missing max_model_len validation (vllm-project#12119) Signed-off-by: Wallas Santos <[email protected]> * [Bugfix] Mistral tokenizer encode accept list of str (vllm-project#12149) Signed-off-by: Kunshang Ji <[email protected]> * [AMD][FP8] Using MI300 FP8 format on ROCm for block_quant (vllm-project#12134) Signed-off-by: Gregory Shtrasberg <[email protected]> * [torch.compile] disable logging when cache is disabled (vllm-project#12043) Signed-off-by: youkaichao <[email protected]> * [misc] fix cross-node TP (vllm-project#12166) Signed-off-by: youkaichao <[email protected]> * [AMD][CI/Build][Bugfix] use pytorch stale wheel (vllm-project#12172) Signed-off-by: hongxyan <[email protected]> * [core] further polish memory profiling (vllm-project#12126) Signed-off-by: youkaichao <[email protected]> * [Docs] Fix broken link in SECURITY.md (vllm-project#12175) Signed-off-by: Russell Bryant <[email protected]> * [Model] Port deepseek-vl2 processor, remove dependency (vllm-project#12169) Signed-off-by: Isotr0py <[email protected]> * [core] clean up executor class hierarchy between v1 and v0 (vllm-project#12171) Signed-off-by: youkaichao <[email protected]> * [Misc] Support register quantization method out-of-tree (vllm-project#11969) * [V1] Collect env var for usage stats (vllm-project#12115) * [BUGFIX] Move scores to float32 in case of running xgrammar on cpu (vllm-project#12152) Signed-off-by: Michal Adamczyk <[email protected]> * [Bugfix] Fix multi-modal processors for transformers 4.48 (vllm-project#12187) * [torch.compile] store inductor compiled Python file (vllm-project#12182) Signed-off-by: youkaichao <[email protected]> * benchmark_serving support --served-model-name param (vllm-project#12109) Signed-off-by: zibai <[email protected]> Co-authored-by: Roger Wang <[email protected]> * [Misc] Add BNB support to GLM4-V model (vllm-project#12184) Signed-off-by: Isotr0py <[email protected]> * [V1] Add V1 support of Qwen2-VL (vllm-project#12128) Signed-off-by: Roger Wang <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: imkero <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> * [Model] Support for fairseq2 Llama (vllm-project#11442) Signed-off-by: Martin Gleize <[email protected]> Co-authored-by: mgleize user <[email protected]> * [Bugfix] Fix num_heads value for simple connector when tp enabled (vllm-project#12074) Signed-off-by: Shangming Cai <[email protected]> * [torch.compile] fix sym_tensor_indices (vllm-project#12191) Signed-off-by: youkaichao <[email protected]> * Move linting to `pre-commit` (vllm-project#11975) Signed-off-by: Harry Mellor <[email protected]> * [DOC] Fix typo in docstring and assert message (vllm-project#12194) Signed-off-by: Yuan Tang <[email protected]> * [DOC] Add missing docstring in LLMEngine.add_request() (vllm-project#12195) Signed-off-by: Yuan Tang <[email protected]> * [Bugfix] Fix incorrect types in LayerwiseProfileResults (vllm-project#12196) Signed-off-by: Yuan Tang <[email protected]> * [Model] Add Qwen2 PRM model support (vllm-project#12202) Signed-off-by: Isotr0py <[email protected]> * [Core] Interface for accessing model from `VllmRunner` (vllm-project#10353) Signed-off-by: DarkLight1337 <[email protected]> * [misc] add placeholder format.sh (vllm-project#12206) Signed-off-by: youkaichao <[email protected]> * [CI/Build] Remove dummy CI steps (vllm-project#12208) Signed-off-by: DarkLight1337 <[email protected]> * [CI/Build] Make pre-commit faster (vllm-project#12212) Signed-off-by: DarkLight1337 <[email protected]> * [Model] Upgrade Aria to transformers 4.48 (vllm-project#12203) Signed-off-by: DarkLight1337 <[email protected]> * [misc] print a message to suggest how to bypass commit hooks (vllm-project#12217) Signed-off-by: youkaichao <[email protected]> * [core][bugfix] configure env var during import vllm (vllm-project#12209) Signed-off-by: youkaichao <[email protected]> * [V1] Remove `_get_cache_block_size` (vllm-project#12214) Signed-off-by: Chen Zhang <[email protected]> * [Misc] Pass `attention` to impl backend (vllm-project#12218) Signed-off-by: wangxiyuan <[email protected]> * [Bugfix] Fix `HfExampleModels.find_hf_info` (vllm-project#12223) Signed-off-by: DarkLight1337 <[email protected]> * [CI] Pass local python version explicitly to pre-commit mypy.sh (vllm-project#12224) Signed-off-by: Chen Zhang <[email protected]> * Using ROCm6.3.1 base docker and building hipblas-common (#366) * [Misc] Update CODEOWNERS (vllm-project#12229) * fix: update platform detection for M-series arm based MacBook processors (vllm-project#12227) Signed-off-by: isikhi <[email protected]> * [misc] add cuda runtime version to usage data (vllm-project#12190) Signed-off-by: youkaichao <[email protected]> Co-authored-by: Roger Wang <[email protected]> * [bugfix] catch xgrammar unsupported array constraints (vllm-project#12210) Signed-off-by: Jason Cheng <[email protected]> * [Kernel] optimize moe_align_block_size for cuda graph and large num_experts (e.g. DeepSeek-V3) (vllm-project#12222) Signed-off-by: Jinzhen Lin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * Add quantization and guided decoding CODEOWNERS (vllm-project#12228) Signed-off-by: mgoin <[email protected]> * [AMD][Build] Porting dockerfiles from the ROCm/vllm fork (vllm-project#11777) Signed-off-by: Gregory Shtrasberg <[email protected]> * [BugFix] Fix GGUF tp>1 when vocab_size is not divisible by 64 (vllm-project#12230) Signed-off-by: NickLucche <[email protected]> * [ci/build] disable failed and flaky tests (vllm-project#12240) Signed-off-by: youkaichao <[email protected]> * [Misc] Rename `MultiModalInputsV2 -> MultiModalInputs` (vllm-project#12244) Signed-off-by: DarkLight1337 <[email protected]> * [Misc]Add BNB quantization for PaliGemmaForConditionalGeneration (vllm-project#12237) Signed-off-by: Jee Jee Li <[email protected]> * [Misc] Remove redundant TypeVar from base model (vllm-project#12248) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Fix mm_limits access for merged multi-modal processor (vllm-project#12252) Signed-off-by: DarkLight1337 <[email protected]> * [torch.compile] transparent compilation with more logging (vllm-project#12246) Signed-off-by: youkaichao <[email protected]> * [V1][Bugfix] Fix data item ordering in mixed-modality inference (vllm-project#12259) Signed-off-by: Roger Wang <[email protected]> * Remove pytorch comments for outlines + compressed-tensors (vllm-project#12260) Signed-off-by: Thomas Parnell <[email protected]> * [Platform] improve platforms getattr (vllm-project#12264) Signed-off-by: Mengqing Cao <[email protected]> * [ci/build] update nightly torch for gh200 test (vllm-project#12270) Signed-off-by: youkaichao <[email protected]> * [Bugfix] fix race condition that leads to wrong order of token returned (vllm-project#10802) Signed-off-by: Jannis Schönleber <[email protected]> * [Kernel] fix moe_align_block_size error condition (vllm-project#12239) Signed-off-by: Jinzhen Lin <[email protected]> * [v1][stats][1/n] Add RequestStatsUpdate and RequestStats types (vllm-project#10907) Signed-off-by: rickyx <[email protected]> * [Bugfix] Multi-sequence broken (vllm-project#11898) Signed-off-by: Andy Lo <[email protected]> * [Misc] Remove experimental dep from tracing.py (vllm-project#12007) Signed-off-by: Adrian Cole <[email protected]> * [Misc] Set default backend to SDPA for get_vit_attn_backend (vllm-project#12235) Signed-off-by: wangxiyuan <[email protected]> * [Core] Free CPU pinned memory on environment cleanup (vllm-project#10477) * Update pre-commit.yml (#374) * Update pre-commit.yml * Reapplying missing format * New codespell exclude location --------- Co-authored-by: Kevin H. Luu <[email protected]> * [bugfix] moe tuning. rm is_navi() (vllm-project#12273) Signed-off-by: Divakar Verma <[email protected]> * [BUGFIX] When skip_tokenize_init and multistep are set, execution crashes (vllm-project#12277) Signed-off-by: maleksan85 <[email protected]> Co-authored-by: maleksan85 <[email protected]> * [Documentation][AMD] Add information about prebuilt ROCm vLLM docker for perf validation purpose (vllm-project#12281) Signed-off-by: Hongxia Yang <[email protected]> * [VLM] Simplify post-processing of replacement info (vllm-project#12269) Signed-off-by: DarkLight1337 <[email protected]> * [ci/lint] Add back default arg for pre-commit (vllm-project#12279) Signed-off-by: kevin <[email protected]> * [CI] add docker volume prune to neuron CI (vllm-project#12291) Signed-off-by: Liangfu Chen <[email protected]> * [Ci/Build] Fix mypy errors on main (vllm-project#12296) Signed-off-by: DarkLight1337 <[email protected]> * [Benchmark] More accurate TPOT calc in `benchmark_serving.py` (vllm-project#12288) Signed-off-by: Nick Hill <[email protected]> * [core] separate builder init and builder prepare for each batch (vllm-project#12253) Signed-off-by: youkaichao <[email protected]> * [Build] update requirements of no-device (vllm-project#12299) Signed-off-by: Mengqing Cao <[email protected]> * [Core] Support fully transparent sleep mode (vllm-project#11743) Signed-off-by: youkaichao <[email protected]> * [VLM] Avoid unnecessary tokenization (vllm-project#12310) Signed-off-by: DarkLight1337 <[email protected]> * [Model][Bugfix]: correct Aria model output (vllm-project#12309) Signed-off-by: xffxff <[email protected]> * [Bugfix][VLM] Fix mixed-modality inference backward compatibility for V0 (vllm-project#12313) Signed-off-by: Roger Wang <[email protected]> * [Doc] Add docs for prompt replacement (vllm-project#12318) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Fix the error in the tip for the --lora-modules parameter (vllm-project#12319) Signed-off-by: wangerxiao <[email protected]> * [Misc] Improve the readability of BNB error messages (vllm-project#12320) Signed-off-by: Jee Jee Li <[email protected]> * Skip tokenize/detokenize when it is disabled by arg --skip-tokenizer-init (#367) * switching detokenize flag to be False * detokenize = False for benchmarks * restoring default in main vllm code for detokenize * removing extra spaces * moving detokenize to flag * adding support for token ids --------- Co-authored-by: maleksan85 <[email protected]> * [Bugfix] Fix HPU multiprocessing executor (vllm-project#12167) Signed-off-by: Konrad Zawora <[email protected]> * [Core] Support `reset_prefix_cache` (vllm-project#12284) * [Frontend][V1] Online serving performance improvements (vllm-project#12287) * [AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD (vllm-project#12282) Signed-off-by: Randall Smith <[email protected]> * FP8 FA fixes (#381) * FP8 FA fixes Summary: Add missing clamp and fix reciprocal scale computation. * linter * Returning the use of the proper stream in allreduce (#382) * [Bugfix] Fixing AMD LoRA CI test. (vllm-project#12329) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Docs] Update FP8 KV Cache documentation (vllm-project#12238) Signed-off-by: mgoin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Docs] Document vulnerability disclosure process (vllm-project#12326) Signed-off-by: Russell Bryant <[email protected]> * [V1] Add `uncache_blocks` (vllm-project#12333) * [doc] explain common errors around torch.compile (vllm-project#12340) Signed-off-by: youkaichao <[email protected]> * [Hardware][Gaudi][BugFix] Fix dataclass error due to triton package update (vllm-project#12338) Signed-off-by: zhenwei <[email protected]> * [Bugfix] Fix k_proj's bias for whisper self attention (vllm-project#12342) Signed-off-by: Isotr0py <[email protected]> * [Kernel] Flash Attention 3 Support (vllm-project#12093) Signed-off-by: Lucas Wilkinson <[email protected]> * [Doc] Troubleshooting errors during model inspection (vllm-project#12351) Signed-off-by: DarkLight1337 <[email protected]> * [V1] Simplify M-RoPE (vllm-project#12352) Signed-off-by: Roger Wang <[email protected]> Co-authored-by: imkero <[email protected]> * [Bugfix] Fix broken internvl2 inference with v1 (vllm-project#12360) Signed-off-by: Isotr0py <[email protected]> * [core] add wake_up doc and some sanity check (vllm-project#12361) Signed-off-by: youkaichao <[email protected]> * [torch.compile] decouple compile sizes and cudagraph sizes (vllm-project#12243) Signed-off-by: youkaichao <[email protected]> * [FP8][Kernel] Dynamic kv cache scaling factors computation (vllm-project#11906) Signed-off-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Micah Williamson <[email protected]> * [TPU] Update TPU CI to use torchxla nightly on 20250122 (vllm-project#12334) Signed-off-by: Siyuan Liu <[email protected]> * [Docs] Document Phi-4 support (vllm-project#12362) Signed-off-by: Isotr0py <[email protected]> * [BugFix] Fix parameter names and `process_after_weight_loading` for W4A16 MoE Group Act Order (vllm-project#11528) Signed-off-by: ElizaWszola <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: Michael Goin <[email protected]> * [Misc] Fix OpenAI API Compatibility Issues in Benchmark Script (vllm-project#12357) Signed-off-by: Junichi Sato <[email protected]> * [Docs] Add meetup slides (vllm-project#12345) Signed-off-by: Woosuk Kwon <[email protected]> * Using pytorch commit past the point when rowwise PR (pytorch/pytorch#144432) was merged (#384) * [Docs] Update spec decode + structured output in compat matrix (vllm-project#12373) Signed-off-by: Russell Bryant <[email protected]> * [V1][Frontend] Coalesce bunched `RequestOutput`s (vllm-project#12298) Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * Set weights_only=True when using torch.load() (vllm-project#12366) Signed-off-by: Russell Bryant <[email protected]> * [Bugfix] Path join when building local path for S3 clone (vllm-project#12353) Signed-off-by: Omer Dayan (SW-GPU) <[email protected]> * Update compressed-tensors version (vllm-project#12367) * [V1] Increase default batch size for H100/H200 (vllm-project#12369) Signed-off-by: Woosuk Kwon <[email protected]> * [perf] fix perf regression from vllm-project#12253 (vllm-project#12380) Signed-off-by: youkaichao <[email protected]> * [Misc] Use VisionArena Dataset for VLM Benchmarking (vllm-project#12389) Signed-off-by: Roger Wang <[email protected]> * [ci/build] fix wheel size check (vllm-project#12396) Signed-off-by: youkaichao <[email protected]> * [Hardware][Gaudi][Doc] Add missing step in setup instructions (vllm-project#12382) * [ci/build] sync default value for wheel size (vllm-project#12398) Signed-off-by: youkaichao <[email protected]> * [Misc] Enable proxy support in benchmark script (vllm-project#12356) Signed-off-by: Junichi Sato <[email protected]> * [Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (vllm-project#12375) Signed-off-by: Lucas Wilkinson <[email protected]> * Applying scales rename to fp8 config (#387) * [Misc] Remove deprecated code (vllm-project#12383) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build only supports pack_gqa (for build size reasons). (vllm-project#12405) Signed-off-by: Lucas Wilkinson <[email protected]> * Dev-docker Documentation Updates (#378) * Dev-docker Documentation Updates Minor updates to several sections, with links to other documents where appropriate. * Fix formatting of GEMM filename * README cleanup - Reorder some sections of the README to make them easier to follow - Improve formatting of bash commands - Prefer use of huggingface model names instead of hard-coded directories - Clean up wording * Expanded sample commands for Latency and Throughput * Fix markdown links * Fix pre-commit errors * Updates from review Initial updates to incorporate feedback from a review session held with @t-parry * Update script args to match current recommendations * Remove recommended max-num-seqs values for now --------- Co-authored-by: Gregory Shtrasberg <[email protected]> * [Bugfix][Kernel] Fix moe align block issue for mixtral (vllm-project#12413) * [Bugfix] Fix BLIP-2 processing (vllm-project#12412) Signed-off-by: DarkLight1337 <[email protected]> * [ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 (vllm-project#12408) Signed-off-by: Divakar Verma <[email protected]> * [Misc] Add FA2 support to ViT MHA layer (vllm-project#12355) Signed-off-by: Isotr0py <[email protected]> * [TPU][CI] Update torchxla version in requirement-tpu.txt (vllm-project#12422) Signed-off-by: Siyuan Liu <[email protected]> * [Misc][Bugfix] FA3 support to ViT MHA layer (vllm-project#12435) Signed-off-by: Roger Wang <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> * [V1][Perf] Reduce scheduling overhead in model runner after cuda sync (vllm-project#12094) Signed-off-by: Keyun Tong <[email protected]> * [V1][Bugfix] Fix assertion when mm hashing is turned off (vllm-project#12439) Signed-off-by: Roger Wang <[email protected]> * [Misc] Revert FA on ViT vllm-project#12355 and vllm-project#12435 (vllm-project#12445) * [Frontend] generation_config.json for maximum tokens(vllm-project#12242) Signed-off-by: Matthew Hendrey <[email protected]> Signed-off-by: Shangming Cai <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Yuan Tang <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: wangxiyuan <[email protected]> Co-authored-by: shangmingc <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Yuan Tang <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: wangxiyuan <[email protected]> * [Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (vllm-project#12417) Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: mgoin <[email protected]> * [Bugfix/CI] Fix broken kernels/test_mha.py (vllm-project#12450) * [Bugfix][Kernel] Fix perf regression caused by PR vllm-project#12405 (vllm-project#12434) Signed-off-by: Lucas Wilkinson <[email protected]> * [Build/CI] Fix libcuda.so linkage (vllm-project#12424) Signed-off-by: Tyler Michael Smith <[email protected]> * [Frontend] Rerank API (Jina- and Cohere-compatible API) (vllm-project#12376) Signed-off-by: Kyle Mistele <[email protected]> * [DOC] Add link to vLLM blog (vllm-project#12460) Signed-off-by: Yuan Tang <[email protected]> * [V1] Avoid list creation in input preparation (vllm-project#12457) Signed-off-by: Woosuk Kwon <[email protected]> * [Frontend] Support scores endpoint in run_batch (vllm-project#12430) Signed-off-by: Pooya Davoodi <[email protected]> * [Bugfix] Fix Granite 3.0 MoE model loading (vllm-project#12446) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Fix missing seq_start_loc in xformers prefill metadata (vllm-project#12464) Signed-off-by: Isotr0py <[email protected]> * [V1][Minor] Minor optimizations for update_from_output (vllm-project#12454) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix] Fix gpt2 GGUF inference (vllm-project#12467) Signed-off-by: Isotr0py <[email protected]> * [Build] Only build 9.0a for scaled_mm and sparse kernels (vllm-project#12339) Signed-off-by: Lucas Wilkinson <[email protected]> * [V1][Metrics] Add initial Prometheus logger (vllm-project#12416) Signed-off-by: Mark McLoughlin <[email protected]> * [V1][CI/Test] Do basic test for top-p & top-k sampling (vllm-project#12469) Signed-off-by: Woosuk Kwon <[email protected]> * [FlashInfer] Upgrade to 0.2.0 (vllm-project#11194) Signed-off-by: Bowen Wang <[email protected]> Signed-off-by: youkaichao <[email protected]> Co-authored-by: youkaichao <[email protected]> * Support FP8 FA from Quark format (#388) * Support FP8 FA from Quark format * Support FP8 FA from Quark format * nit: update comment * Direct call on ROCm * 20250127 docs update (#392) * updating code blocks * typo * updated manifest * Including feedback * whitespace * Deepseek instructions * hyperlink fix * hyperlink fix * updating what is new * cpx update * typo * whitespace * whitespace * Faster Custom Paged Attention kernels (#372) * integrate new cpa kernel, update tests and benchmark * added comments to mfma4 kernel * further comments for mfma16 kernel * clang-format * Lint * add flag for logits rtz conversion and disable by default * lint * [Bugfix]: Fix paged attention unit tests of #372 (#389) * [Bugfix]: fix paged attention tests based on the updated kernels in `csrc/attention/paged_attention_v1.cu`,`csrc/attention/paged_attention_v2.cu` and `csrc/rocm/attention.cu`. * improve code documentation. * lint --------- Co-authored-by: vllmellm <[email protected]> --------- Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Joe Shajrawi <[email protected]> Co-authored-by: TJian <[email protected]> Co-authored-by: vllmellm <[email protected]> * Using a more precise profiling on ROCm to properly account for weights padding (#394) * Update Dockerfile.rocm * [Bugfix]: inclucde the env variables required for running FastSyncLLM Signed-off-by: vllmellm <[email protected]> * fix pre-commit lint Signed-off-by: vllmellm <[email protected]> --------- Signed-off-by: Roger Wang <[email protected]> Signed-off-by: Rafael Vasquez <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Akshat Tripathi <[email protected]> Signed-off-by: Oleg Mosalov <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Sungjae Lee <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Chenguang Li <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Alex-Brooks <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Shanshan Shen <[email protected]> Signed-off-by: elijah <[email protected]> Signed-off-by: Yikun <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: tjtanaa <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: yisheng <[email protected]> Signed-off-by: Abatom <[email protected]> Signed-off-by: Liangfu Chen <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Yuan Zhou <[email protected]> Signed-off-by: Sourashis Roy <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Ilya Lavrenov <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: Wallas Santos <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: yan ma <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]> Signed-off-by: Ye Qi <[email protected]> Signed-off-by: Mengqing Cao <[email protected]> Signed-off-by: Joe Runde <[email protected]> Signed-off-by: Kunshang Ji <[email protected]> Signed-off-by: Kuntai Du <[email protected]> Signed-off-by: Ren MinMin <[email protected]> Signed-off-by: Travis Johnson <[email protected]> Signed-off-by: Fred Reiss <[email protected]> Signed-off-by: shaochangxu.scx <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: Rui Qiao <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: Rahul Tuli <[email protected]> Signed-off-by: kewang-xlnx <[email protected]> Signed-off-by: kewang2 <[email protected]> Signed-off-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Yuan Tang <[email protected]> Signed-off-by: Divakar Verma <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: hongxyan <[email protected]> Signed-off-by: Michal Adamczyk <[email protected]> Signed-off-by: zibai <[email protected]> Signed-off-by: Martin Gleize <[email protected]> Signed-off-by: Shangming Cai <[email protected]> Signed-off-by: isikhi <[email protected]> Signed-off-by: Jason Cheng <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]> Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Jannis Schönleber <[email protected]> Signed-off-by: rickyx <[email protected]> Signed-off-by: Andy Lo <[email protected]> Signed-off-by: Adrian Cole <[email protected]> Signed-off-by: maleksan85 <[email protected]> Signed-off-by: Hongxia Yang <[email protected]> Signed-off-by: kevin <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: xffxff <[email protected]> Signed-off-by: wangerxiao <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: zhenwei <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Junichi Sato <[email protected]> Signed-off-by: Omer Dayan (SW-GPU) <[email protected]> Signed-off-by: Keyun Tong <[email protected]> Signed-off-by: Matthew Hendrey <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Kyle Mistele <[email protected]> Signed-off-by: Pooya Davoodi <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]> Signed-off-by: Bowen Wang <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Rafael Vasquez <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Akshat Tripathi <[email protected]> Co-authored-by: Oleg Mosalov <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Avshalom Manevich <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Yangcheng Li <[email protected]> Co-authored-by: Siyuan Li <[email protected]> Co-authored-by: Sungjae Lee <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Chenguang Li <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Alex Brooks <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Shanshan Shen <[email protected]> Co-authored-by: elijah <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Steve Luo <[email protected]> Co-authored-by: mgoin <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: Alexei V. Ivanov <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: wangxiyuan <[email protected]> Co-authored-by: maang-h <[email protected]> Co-authored-by: YiSheng5 <[email protected]> Co-authored-by: Zhonghua Deng <[email protected]> Co-authored-by: Liangfu Chen <[email protected]> Co-authored-by: XiaobingZhang <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Yuan <[email protected]> Co-authored-by: jiangjiadi <[email protected]> Co-authored-by: jiadi.jjd <[email protected]> Co-authored-by: sroy745 <[email protected]> Co-authored-by: Jie Fu (傅杰) <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: WangErXiao <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Ilya Lavrenov <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Wallas Henrique <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Yan Ma <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Maximilien de Bayser <[email protected]> Co-authored-by: Maxime Fournioux <[email protected]> Co-authored-by: Guspan Tanadi <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: yeq <[email protected]> Co-authored-by: Mengqing Cao <[email protected]> Co-authored-by: Charles Frye <[email protected]> Co-authored-by: Joe Runde <[email protected]> Co-authored-by: Kunshang Ji <[email protected]> Co-authored-by: cennn <[email protected]> Co-authored-by: Kuntai Du <[email protected]> Co-authored-by: minmin <[email protected]> Co-authored-by: Ren MinMin <[email protected]> Co-authored-by: Travis Johnson <[email protected]> Co-authored-by: Fred Reiss <[email protected]> Co-authored-by: shaochangxu <[email protected]> Co-authored-by: shaochangxu.scx <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: sixgod <[email protected]> Co-authored-by: Elfie Guo <[email protected]> Co-authored-by: Rui Qiao <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: Rahul Tuli <[email protected]> Co-authored-by: Keyun Tong <[email protected]> Co-authored-by: RunningLeon <[email protected]> Co-authored-by: kewang-xlnx <[email protected]> Co-authored-by: kewang2 <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: tvirolai-amd <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Zhaoyi Li <[email protected]> Co-authored-by: charlifu <[email protected]> Co-authored-by: Yuan Tang <[email protected]> Co-authored-by: Cody Yu <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Co-authored-by: yancong <[email protected]> Co-authored-by: Michal Adamczyk <[email protected]> Co-authored-by: gujing <[email protected]> Co-authored-by: imkero <[email protected]> Co-authored-by: Martin Gleize <[email protected]> Co-authored-by: mgleize user <[email protected]> Co-authored-by: shangmingc <[email protected]> Co-authored-by: Işık <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Cheng Kuan Yong Jason <[email protected]> Co-authored-by: Jinzhen Lin <[email protected]> Co-authored-by: Thomas Parnell <[email protected]> Co-authored-by: Jannis Schönleber <[email protected]> Co-authored-by: Ricky Xu <[email protected]> Co-authored-by: Andy Lo <[email protected]> Co-authored-by: Adrian Cole <[email protected]> Co-authored-by: Jani Monoses <[email protected]> Co-authored-by: Kevin H. Luu <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> Co-authored-by: maleksan85 <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: zhou fan <[email protected]> Co-authored-by: ilia-cher <[email protected]> Co-authored-by: liuzhenwei <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: Micah Williamson <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: Junichi Sato <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: omer-dayan <[email protected]> Co-authored-by: Mohit Deopujari <[email protected]> Co-authored-by: Jeremy Arnold <[email protected]> Co-authored-by: Matthew Hendrey <[email protected]> Co-authored-by: Kyle Mistele <[email protected]> Co-authored-by: Pooya Davoodi <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Bowen Wang <[email protected]> Co-authored-by: Bowen Bao <[email protected]> Co-authored-by: arakowsk-amd <[email protected]> Co-authored-by: sanyalington <[email protected]> Co-authored-by: Joe Shajrawi <[email protected]> Co-authored-by: vllmellm <[email protected]>
1 parent 080a4bf commit 479b843

File tree

434 files changed

+17085
-9053
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

434 files changed

+17085
-9053
lines changed

.buildkite/check-wheel-size.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,11 @@
22
import sys
33
import zipfile
44

5-
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 250 MB
6-
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 250))
5+
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 300 MiB
6+
# Note that we have 400 MiB quota, please use it wisely.
7+
# See https://github.com/pypi/support/issues/3792 .
8+
# Please also sync the value with the one in Dockerfile.
9+
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 300))
710

811

912
def print_top_10_largest_files(zip_file):

.buildkite/nightly-benchmarks/scripts/nightly-annotate.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ main() {
4343

4444

4545

46-
# The figures should be genereated by a separate process outside the CI/CD pipeline
46+
# The figures should be generated by a separate process outside the CI/CD pipeline
4747

4848
# # generate figures
4949
# python3 -m pip install tabulate pandas matplotlib

.buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -301,6 +301,104 @@ run_serving_tests() {
301301
kill_gpu_processes
302302
}
303303

304+
run_genai_perf_tests() {
305+
# run genai-perf tests
306+
307+
# $1: a json file specifying genai-perf test cases
308+
local genai_perf_test_file
309+
genai_perf_test_file=$1
310+
311+
# Iterate over genai-perf tests
312+
jq -c '.[]' "$genai_perf_test_file" | while read -r params; do
313+
# get the test name, and append the GPU type back to it.
314+
test_name=$(echo "$params" | jq -r '.test_name')
315+
316+
# if TEST_SELECTOR is set, only run the test cases that match the selector
317+
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
318+
echo "Skip test case $test_name."
319+
continue
320+
fi
321+
322+
# prepend the current serving engine to the test name
323+
test_name=${CURRENT_LLM_SERVING_ENGINE}_${test_name}
324+
325+
# get common parameters
326+
common_params=$(echo "$params" | jq -r '.common_parameters')
327+
model=$(echo "$common_params" | jq -r '.model')
328+
tp=$(echo "$common_params" | jq -r '.tp')
329+
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
330+
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
331+
port=$(echo "$common_params" | jq -r '.port')
332+
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
333+
reuse_server=$(echo "$common_params" | jq -r '.reuse_server')
334+
335+
# get client and server arguments
336+
server_params=$(echo "$params" | jq -r ".${CURRENT_LLM_SERVING_ENGINE}_server_parameters")
337+
qps_list=$(echo "$params" | jq -r '.qps_list')
338+
qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
339+
echo "Running over qps list $qps_list"
340+
341+
# check if there is enough GPU to run the test
342+
if [[ $gpu_count -lt $tp ]]; then
343+
echo "Required num-shard $tp but only $gpu_count GPU found. Skip testcase $test_name."
344+
continue
345+
fi
346+
347+
if [[ $reuse_server == "true" ]]; then
348+
echo "Reuse previous server for test case $test_name"
349+
else
350+
kill_gpu_processes
351+
bash "$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/launch-server.sh" \
352+
"$server_params" "$common_params"
353+
fi
354+
355+
if wait_for_server; then
356+
echo ""
357+
echo "$CURRENT_LLM_SERVING_ENGINE server is up and running."
358+
else
359+
echo ""
360+
echo "$CURRENT_LLM_SERVING_ENGINE failed to start within the timeout period."
361+
break
362+
fi
363+
364+
# iterate over different QPS
365+
for qps in $qps_list; do
366+
# remove the surrounding single quote from qps
367+
if [[ "$qps" == *"inf"* ]]; then
368+
echo "qps was $qps"
369+
qps=$num_prompts
370+
echo "now qps is $qps"
371+
fi
372+
373+
new_test_name=$test_name"_qps_"$qps
374+
backend=$CURRENT_LLM_SERVING_ENGINE
375+
376+
if [[ "$backend" == *"vllm"* ]]; then
377+
backend="vllm"
378+
fi
379+
#TODO: add output dir.
380+
client_command="genai-perf profile \
381+
-m $model \
382+
--service-kind openai \
383+
--backend vllm \
384+
--endpoint-type chat \
385+
--streaming \
386+
--url localhost:$port \
387+
--request-rate $qps \
388+
--num-prompts $num_prompts \
389+
"
390+
391+
echo "Client command: $client_command"
392+
393+
eval "$client_command"
394+
395+
#TODO: process/record outputs
396+
done
397+
done
398+
399+
kill_gpu_processes
400+
401+
}
304402

305403
prepare_dataset() {
306404

@@ -328,12 +426,17 @@ main() {
328426

329427
pip install -U transformers
330428

429+
pip install -r requirements-dev.txt
430+
which genai-perf
431+
331432
# check storage
332433
df -h
333434

334435
ensure_installed wget
335436
ensure_installed curl
336437
ensure_installed jq
438+
# genai-perf dependency
439+
ensure_installed libb64-0d
337440

338441
prepare_dataset
339442

@@ -345,6 +448,10 @@ main() {
345448
# run the test
346449
run_serving_tests "$BENCHMARK_ROOT/tests/nightly-tests.json"
347450

451+
# run genai-perf tests
452+
run_genai_perf_tests "$BENCHMARK_ROOT/tests/genai-perf-tests.json"
453+
mv artifacts/ $RESULTS_FOLDER/
454+
348455
# upload benchmark results to buildkite
349456
python3 -m pip install tabulate pandas
350457
python3 "$BENCHMARK_ROOT/scripts/summary-nightly-results.py"
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
[
2+
{
3+
"test_name": "llama8B_tp1_genai_perf",
4+
"qps_list": [4,8,16,32],
5+
"common_parameters": {
6+
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
7+
"tp": 1,
8+
"port": 8000,
9+
"num_prompts": 500,
10+
"reuse_server": false
11+
},
12+
"vllm_server_parameters": {
13+
"disable_log_stats": "",
14+
"disable_log_requests": "",
15+
"gpu_memory_utilization": 0.9,
16+
"num_scheduler_steps": 10,
17+
"max_num_seqs": 512,
18+
"dtype": "bfloat16"
19+
},
20+
"genai_perf_input_parameters": {
21+
}
22+
}
23+
]

.buildkite/run-cpu-test.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,6 @@ function cpu_tests() {
8383
tests/lora/test_qwen2vl.py"
8484
}
8585

86-
# All of CPU tests are expected to be finished less than 25 mins.
86+
# All of CPU tests are expected to be finished less than 40 mins.
8787
export -f cpu_tests
88-
timeout 30m bash -c "cpu_tests $CORE_RANGE $NUMA_NODE"
88+
timeout 40m bash -c "cpu_tests $CORE_RANGE $NUMA_NODE"

.buildkite/run-hpu-test.sh

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,17 @@ set -ex
88
docker build -t hpu-test-env -f Dockerfile.hpu .
99

1010
# Setup cleanup
11+
# certain versions of HPU software stack have a bug that can
12+
# override the exit code of the script, so we need to use
13+
# separate remove_docker_container and remove_docker_container_and_exit
14+
# functions, while other platforms only need one remove_docker_container
15+
# function.
16+
EXITCODE=1
1117
remove_docker_container() { docker rm -f hpu-test || true; }
12-
trap remove_docker_container EXIT
18+
remove_docker_container_and_exit() { remove_docker_container; exit $EXITCODE; }
19+
trap remove_docker_container_and_exit EXIT
1320
remove_docker_container
1421

1522
# Run the image and launch offline inference
16-
docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference/basic.py
23+
docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference/basic.py
24+
EXITCODE=$?

.buildkite/run-neuron-test.sh

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,11 @@ if [ -f /tmp/neuron-docker-build-timestamp ]; then
2525
last_build=$(cat /tmp/neuron-docker-build-timestamp)
2626
current_time=$(date +%s)
2727
if [ $((current_time - last_build)) -gt 86400 ]; then
28+
# Remove dangling images (those that are not tagged and not used by any container)
2829
docker image prune -f
29-
docker system prune -f
30+
# Remove unused volumes / force the system prune for old images as well.
31+
docker volume prune -f && docker system prune -f
32+
# Remove huggingface model artifacts and compiler cache
3033
rm -rf "${HF_MOUNT:?}/*"
3134
rm -rf "${NEURON_COMPILE_CACHE_MOUNT:?}/*"
3235
echo "$current_time" > /tmp/neuron-docker-build-timestamp

.buildkite/test-pipeline.yaml

Lines changed: 26 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,6 @@ steps:
5252
- tests/worker
5353
- tests/standalone_tests/lazy_torch_compile.py
5454
commands:
55-
- pip install git+https://github.com/Isotr0py/DeepSeek-VL2.git # Used by multimoda processing test
5655
- python3 standalone_tests/lazy_torch_compile.py
5756
- pytest -v -s mq_llm_engine # MQLLMEngine
5857
- pytest -v -s async_engine # AsyncLLMEngine
@@ -77,7 +76,9 @@ steps:
7776
- tests/basic_correctness/test_basic_correctness
7877
- tests/basic_correctness/test_cpu_offload
7978
- tests/basic_correctness/test_preemption
79+
- tests/basic_correctness/test_cumem.py
8080
commands:
81+
- pytest -v -s basic_correctness/test_cumem.py
8182
- pytest -v -s basic_correctness/test_basic_correctness.py
8283
- pytest -v -s basic_correctness/test_cpu_offload.py
8384
- VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1 pytest -v -s basic_correctness/test_preemption.py
@@ -107,7 +108,7 @@ steps:
107108
source_file_dependencies:
108109
- vllm/
109110
commands:
110-
- pytest -v -s entrypoints/llm --ignore=entrypoints/llm/test_lazy_outlines.py --ignore=entrypoints/llm/test_generate.py --ignore=entrypoints/llm/test_generate_multiple_loras.py --ignore=entrypoints/llm/test_guided_generate.py
111+
- pytest -v -s entrypoints/llm --ignore=entrypoints/llm/test_lazy_outlines.py --ignore=entrypoints/llm/test_generate.py --ignore=entrypoints/llm/test_generate_multiple_loras.py --ignore=entrypoints/llm/test_guided_generate.py --ignore=entrypoints/llm/test_collective_rpc.py
111112
- pytest -v -s entrypoints/llm/test_lazy_outlines.py # it needs a clean process
112113
- pytest -v -s entrypoints/llm/test_generate.py # it needs a clean process
113114
- pytest -v -s entrypoints/llm/test_generate_multiple_loras.py # it needs a clean process
@@ -126,11 +127,15 @@ steps:
126127
- tests/distributed
127128
- tests/spec_decode/e2e/test_integration_dist_tp4
128129
- tests/compile
130+
- examples/offline_inference/rlhf.py
129131
commands:
130132
- pytest -v -s distributed/test_utils.py
131133
- pytest -v -s compile/test_basic_correctness.py
132134
- pytest -v -s distributed/test_pynccl.py
133135
- pytest -v -s spec_decode/e2e/test_integration_dist_tp4.py
136+
# TODO: create a dedicated test section for multi-GPU example tests
137+
# when we have multiple distributed example tests
138+
- python3 ../examples/offline_inference/rlhf.py
134139

135140
- label: Metrics, Tracing Test # 10min
136141
num_gpus: 2
@@ -178,7 +183,16 @@ steps:
178183
- vllm/
179184
- tests/v1
180185
commands:
181-
- VLLM_USE_V1=1 pytest -v -s v1
186+
# split the test to avoid interference
187+
- VLLM_USE_V1=1 pytest -v -s v1/core
188+
- VLLM_USE_V1=1 pytest -v -s v1/engine
189+
- VLLM_USE_V1=1 pytest -v -s v1/sample
190+
- VLLM_USE_V1=1 pytest -v -s v1/worker
191+
- VLLM_USE_V1=1 pytest -v -s v1/test_stats.py
192+
- VLLM_USE_V1=1 pytest -v -s v1/test_utils.py
193+
# TODO: accuracy does not match, whether setting
194+
# VLLM_USE_FLASHINFER_SAMPLER or not on H100.
195+
- VLLM_USE_V1=1 pytest -v -s v1/e2e
182196

183197
- label: Examples Test # 25min
184198
working_dir: "/vllm-workspace/examples"
@@ -462,7 +476,10 @@ steps:
462476
- vllm/worker/worker_base.py
463477
- vllm/worker/worker.py
464478
- vllm/worker/model_runner.py
479+
- entrypoints/llm/test_collective_rpc.py
465480
commands:
481+
- pytest -v -s entrypoints/llm/test_collective_rpc.py
482+
- torchrun --nproc-per-node=2 distributed/test_torchrun_example.py
466483
- pytest -v -s ./compile/test_basic_correctness.py
467484
- pytest -v -s ./compile/test_wrapper.py
468485
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
@@ -471,7 +488,9 @@ steps:
471488
- pytest models/encoder_decoder/language/test_bart.py -v -s -m 'distributed(num_gpus=2)'
472489
- pytest models/encoder_decoder/vision_language/test_broadcast.py -v -s -m 'distributed(num_gpus=2)'
473490
- pytest models/decoder_only/vision_language/test_models.py -v -s -m 'distributed(num_gpus=2)'
474-
- pytest -v -s spec_decode/e2e/test_integration_dist_tp2.py
491+
# this test fails consistently.
492+
# TODO: investigate and fix
493+
# - pytest -v -s spec_decode/e2e/test_integration_dist_tp2.py
475494
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s test_sharded_state_loader.py
476495
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s kv_transfer/disagg_test.py
477496

@@ -509,7 +528,9 @@ steps:
509528
- vllm/engine
510529
- tests/multi_step
511530
commands:
512-
- pytest -v -s multi_step/test_correctness_async_llm.py
531+
# this test is quite flaky
532+
# TODO: investigate and fix.
533+
# - pytest -v -s multi_step/test_correctness_async_llm.py
513534
- pytest -v -s multi_step/test_correctness_llm.py
514535

515536
- label: Pipeline Parallelism Test # 45min

.buildkite/test-template.j2

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ steps:
2727
depends_on:
2828
- "amd-build"
2929
agents:
30-
queue: amd_rocm_gpu
30+
queue: amd_gpu
3131
commands:
3232
- bash .buildkite/run-amd-test.sh "cd {{ (step.working_dir or default_working_dir) | safe }} ; {{ step.command or (step.commands | join(" && ")) | safe }}"
3333
env:

.github/CODEOWNERS

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -2,32 +2,35 @@
22
# for more info about CODEOWNERS file
33

44
# This lists cover the "core" components of vLLM that require careful review
5-
/vllm/attention/backends/abstract.py @WoosukKwon @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
6-
/vllm/core @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
7-
/vllm/engine/llm_engine.py @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
8-
/vllm/executor/executor_base.py @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
9-
/vllm/worker/worker_base.py @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
10-
/vllm/worker/worker.py @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
11-
/vllm/model_executor/layers/sampler.py @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
5+
/vllm/attention/backends/abstract.py @WoosukKwon @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
6+
/vllm/core @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
7+
/vllm/engine/llm_engine.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
8+
/vllm/executor/executor_base.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
9+
/vllm/worker/worker_base.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
10+
/vllm/worker/worker.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
11+
/vllm/model_executor/layers/sampler.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
12+
/vllm/model_executor/layers/quantization @mgoin @robertgshaw2-redhat @tlrmchlsmth
13+
/vllm/model_executor/guided_decoding @mgoin
14+
/vllm/multimodal @DarkLight1337 @ywang96
1215
CMakeLists.txt @tlrmchlsmth
1316

1417
# vLLM V1
15-
/vllm/v1 @WoosukKwon @robertgshaw2-neuralmagic @njhill @ywang96 @comaniac @alexm-neuralmagic
18+
/vllm/v1 @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @comaniac @alexm-redhat
1619

1720
# Test ownership
18-
/tests/async_engine @njhill @robertgshaw2-neuralmagic @simon-mo
21+
/tests/async_engine @njhill @robertgshaw2-redhat @simon-mo
1922
/tests/test_inputs.py @DarkLight1337 @ywang96
20-
/tests/entrypoints @DarkLight1337 @robertgshaw2-neuralmagic @simon-mo
23+
/tests/entrypoints @DarkLight1337 @robertgshaw2-redhat @simon-mo
2124
/tests/models @DarkLight1337 @ywang96
2225
/tests/multimodal @DarkLight1337 @ywang96
2326
/tests/prefix_caching @comaniac @KuntaiDu
2427
/tests/spec_decode @njhill @LiuXiaoxuanPKU
2528
/tests/kernels @tlrmchlsmth @WoosukKwon
26-
/tests/quantization @mgoin @robertgshaw2-neuralmagic
29+
/tests/quantization @mgoin @robertgshaw2-redhat
2730
/.buildkite/lm-eval-harness @mgoin @simon-mo
2831
/tests/distributed/test_multi_node_assignment.py @youkaichao
2932
/tests/distributed/test_pipeline_parallel.py @youkaichao
3033
/tests/distributed/test_same_node.py @youkaichao
31-
/tests/multi_step @alexm-neuralmagic @comaniac
34+
/tests/multi_step @alexm-redhat @comaniac
3235
/tests/weight_loading @mgoin @youkaichao
3336
/tests/basic_correctness/test_chunked_prefill @rkooo567 @comaniac

0 commit comments

Comments
 (0)