Skip to content

Commit a699a11

Browse files
Alexei-V-Ivanov-AMDhmellormarkmcnjhillmgoin
authored
Merging in the latest merge from vllm-project to ROCm (#472)
* Fix `head_dim` not existing in all model configs (Transformers backend) (vllm-project#14141) Signed-off-by: Harry Mellor <[email protected]> * [V0][Metrics] Remove unimplemented `vllm:tokens_total` (vllm-project#14134) Signed-off-by: Mark McLoughlin <[email protected]> * [V0][Metrics] Deprecate some KV/prefix cache metrics (vllm-project#14136) Signed-off-by: Mark McLoughlin <[email protected]> * [V1] Simplify stats logging (vllm-project#14082) Signed-off-by: Nick Hill <[email protected]> * [WIP][[V1][Metrics] Implement max_num_generation_tokens, request_params_n, and request_params_max_tokens metrics (vllm-project#14055) Signed-off-by: Mark McLoughlin <[email protected]> * [Bugfix] Allow shared_experts skip quantization for DeepSeekV2/V3 (vllm-project#14100) Signed-off-by: mgoin <[email protected]> * [Kernel] Optimize moe intermediate_cache usage (vllm-project#13625) Signed-off-by: mgoin <[email protected]> * [Docs] Add GPTQModel (vllm-project#14056) Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]> * [v1] Add comments to the new ragged paged attention Pallas kernel (vllm-project#14155) Signed-off-by: Xiongfei Wei <[email protected]> Co-authored-by: Michael Goin <[email protected]> * [Model] Add support for GraniteMoeShared models (vllm-project#13313) Signed-off-by: Travis Johnson <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [core] moe fp8 block quant tuning support (vllm-project#14068) Signed-off-by: Divakar Verma <[email protected]> * [Misc] Remove lru_cache in NvmlCudaPlatform (vllm-project#14156) Signed-off-by: Cody Yu <[email protected]> * [core] Pass all driver env vars to ray workers unless excluded (vllm-project#14099) Signed-off-by: Rui Qiao <[email protected]> * Use math.prod instead of np.prod for trivial ops (vllm-project#14142) * Fix benchmark_moe.py tuning for CUDA devices (vllm-project#14164) * [platform] add debug logging during inferring the device type (vllm-project#14195) Signed-off-by: youkaichao <[email protected]> * [sleep mode] error out with expandable_segments (vllm-project#14189) Signed-off-by: youkaichao <[email protected]> * [doc] add "Failed to infer device type" to faq (vllm-project#14200) Signed-off-by: youkaichao <[email protected]> * [Bugfix] Restrict MacOS CPU detection (vllm-project#14210) Signed-off-by: mgoin <[email protected]> * [V1][BugFix] Fix remaining sync engine client shutdown errors/hangs (vllm-project#13869) Signed-off-by: Nick Hill <[email protected]> * [V0][Metrics] Deprecate some questionable request time metrics (vllm-project#14135) Signed-off-by: Mark McLoughlin <[email protected]> * [V1][Molmo] Fix get_multimodal_embeddings() in molmo.py (vllm-project#14161) * add cutlass support for blackwell fp8 gemm (vllm-project#13798) * [TPU][Profiler] Support start_profile/stop_profile in TPU worker (vllm-project#13988) Signed-off-by: Siyuan Liu <[email protected]> Co-authored-by: mgoin <[email protected]> * Fix performance when `--generation-config` is not `None` (vllm-project#14223) Signed-off-by: Harry Mellor <[email protected]> * [Frontend] Do `prompt_logprobs` clamping for chat as well as completions (vllm-project#14225) Signed-off-by: Harry Mellor <[email protected]> * [Docs] Update Dockerfile dependency image (vllm-project#14215) Signed-off-by: mgoin <[email protected]> * [v1][Metrics] Add design doc (vllm-project#12745) Signed-off-by: Mark McLoughlin <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Cody Yu <[email protected]> * [Security] Serialize using safetensors instead of pickle in Mooncake Pipe (vllm-project#14228) Signed-off-by: KuntaiDu <[email protected]> * Clean up unused padding_idx variables across many model definitions (vllm-project#13240) Signed-off-by: Tyler Michael Smith <[email protected]> * [ROCm] Disable a few more kernel tests that are broken on ROCm (vllm-project#14145) Signed-off-by: Sage Moore <[email protected]> * [V1][TPU] TPU multimodal model support for ragged attention (vllm-project#14158) Signed-off-by: Michael Goin <[email protected]> * [misc] announce china meetup (vllm-project#14248) Signed-off-by: youkaichao <[email protected]> * Moved numba from common requirements to cuda/rocm specific requirements (vllm-project#14199) Signed-off-by: Nishidha Panpaliya <[email protected]> * Disable GPTQ AllSpark kernels for CUDA Compiler < 12.0 (vllm-project#14157) Signed-off-by: mgoin <[email protected]> * [Bugfix] Fix gptq_marlin for deepseek-v3 (vllm-project#13750) Signed-off-by: dangshunya <[email protected]> Co-authored-by: dangshunya <[email protected]> * [V1][Bugfix] Do not reset prefix caching metrics (vllm-project#14235) * [Model] New model support for Phi-4-multimodal-instruct (vllm-project#14119) * [V1] EP/TP MoE + DP Attention (vllm-project#13931) * [platforms] improve rocm debugging info (vllm-project#14257) * Temporarily disable test_awq_gemm_opcheck (vllm-project#14251) Signed-off-by: mgoin <[email protected]> * [Frontend] Allow return_tokens_as_token_ids to be passed as a request param (vllm-project#14066) Signed-off-by: Benjamin Chislett <[email protected]> * [Misc][V1] Avoid using `envs.VLLM_USE_V1` in mm processing (vllm-project#14256) Signed-off-by: Roger Wang <[email protected]> * [Bugfix][V1] Fix allowed_token_ids for v1 Sampler (vllm-project#14169) Signed-off-by: Lu Fang <[email protected]> * [Doc] Update nginx guide: remove privileged from vllm container run and add target GPU ID (vllm-project#14217) Signed-off-by: Iacopo Poli <[email protected]> * [Doc] [3/N] Refer code examples for common cases in dev multimodal processor (vllm-project#14278) Signed-off-by: DarkLight1337 <[email protected]> * Small update for external_launcher backend docs (vllm-project#14288) * [V1][Frontend] Add Testing For V1 Runtime Parameters (vllm-project#14159) Signed-off-by: [email protected] <[email protected]> * [LoRA] Remove linear hack outside transformers backend (vllm-project#14177) Signed-off-by: Isotr0py <[email protected]> * [Misc] Add Qwen2MoeForCausalLM moe tuning support (vllm-project#14276) Signed-off-by: Jee Jee Li <[email protected]> * prefix_caching.md: Fixed typo (vllm-project#14293) Signed-off-by: Daivid Savernin-Frenk <[email protected]> * [Bugfix] Fix broken vision language example (vllm-project#14292) Signed-off-by: Isotr0py <[email protected]> * [Docs] Add Meta Slides (vllm-project#14297) Signed-off-by: simon-mo <[email protected]> * [V1][Minor] Remove obsolete FIXME comment (vllm-project#14304) Signed-off-by: Nick Hill <[email protected]> * Deprecate `best_of` Sampling Parameter in anticipation for vLLM V1 (vllm-project#13997) Signed-off-by: vincent-4 <[email protected]> Signed-off-by: Brayden Zhong <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: Harry Mellor <[email protected]> * [V1][BugFix] Fix for mixed top_k batch (vllm-project#14301) Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Ye Cao <[email protected]> * [misc] Add FlashMLA as a new option of VLLM_ATTENTION_BACKEND env (vllm-project#14267) * [V1][Easy] Add empty allowed_token_ids in the v1 sampler test (vllm-project#14308) Signed-off-by: Lu Fang <[email protected]> * init Signed-off-by: Sage Moore <[email protected]> * [Bugfix] Fix DeepSeek MTP crash when using TP1ModelRunner with CUDA graph due to shape mismatch (vllm-project#14237) Signed-off-by: pyc96 <[email protected]> * [Bugfix] Remove num_tokens_across_dp (vllm-project#14302) Signed-off-by: Tyler Michael Smith <[email protected]> * [BugFix] Fix prefix caching V0 MLA (vllm-project#14255) Signed-off-by: Lucas Wilkinson <[email protected]> Co-authored-by: Ying Zhong <[email protected]> * [CI/Build] Use spawn multiprocessing mode for V1 test pipeline (vllm-project#14243) Signed-off-by: Russell Bryant <[email protected]> * Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM (vllm-project#13917) Signed-off-by: mgoin <[email protected]> * [Build] Add UV_HTTP_TIMEOUT to avoid timeout during installation (vllm-project#13850) Signed-off-by: Yuan Tang <[email protected]> * [BugFix] MLA + V1, illegal memory access and accuracy issues (vllm-project#14253) Signed-off-by: Lucas Wilkinson <[email protected]> * [misc] Mention `ray list nodes` command to troubleshoot ray issues (vllm-project#14318) Signed-off-by: Rui Qiao <[email protected]> * [Bugfix][Structured Output] Support outlines engine with reasoning outputs for DeepSeek R1 (vllm-project#14114) * [V1] LoRA - Enable more V1 tests (vllm-project#14315) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix][CI] ALiBi test case in xformers multi_query_kv_attention (vllm-project#11301) * [Hardware] Update the flash attn tag to support Blackwell (vllm-project#14244) * [Model] Update Paligemma multimodal processing with PromptUpdate (vllm-project#14015) Signed-off-by: Kyle Huang <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [V1][VLM][Pixtral-HF] Support Pixtral-HF on V1 (vllm-project#14275) Signed-off-by: Linkun Chen <[email protected]> * [Core] Optimizing cross-attention `QKVParallelLinear` computation (vllm-project#12325) Signed-off-by: NickLucche <[email protected]> Signed-off-by: NickLucche <[email protected]> Co-authored-by: NickLucche <[email protected]> * [Frontend][Docs] Transcription API streaming (vllm-project#13301) Signed-off-by: NickLucche <[email protected]> * [Doc] Update reasoning with stream example to use OpenAI library (vllm-project#14077) Signed-off-by: liuyanyi <[email protected]> * [Doc] Correct beam_search using in generative_models.md (vllm-project#14363) * [Kernel] [V1] Improved performance for V1 Triton (ROCm) backend (vllm-project#14152) * [Bugfix][Core] fix abort_seq_group and memory leak when n>1 (vllm-project#14326) Signed-off-by: courage17340 <[email protected]> * [Core] Don't use cache during multi-modal profiling (vllm-project#14336) * [Doc] Fix date typo in README.md (vllm-project#14366) Signed-off-by: Jitse Klomp <[email protected]> * [RLHF] use worker_extension_cls for compatibility with V0 and V1 (vllm-project#14185) Signed-off-by: youkaichao <[email protected]> * Reinstate `best_of` for V0 (vllm-project#14356) Signed-off-by: Harry Mellor <[email protected]> * Adding cpu inference with VXE ISA for s390x architecture (vllm-project#12613) Signed-off-by: Dilip Gowda Bhagavan <[email protected]> Signed-off-by: Rishika Kedia <[email protected]> Co-authored-by: Rishika Kedia <[email protected]> * Add authors to license header. (vllm-project#14371) Signed-off-by: Thomas Parnell <[email protected]> Co-authored-by: Burkhard Ringlein <[email protected]> Co-authored-by: Jan van Lunteren <[email protected]> * Fix mla prefill context performance (vllm-project#13897) Signed-off-by: ZhongYingMatrix <[email protected]> * [V1] Do not detokenize if sampling param detokenize is False (vllm-project#14224) Signed-off-by: Himanshu Jaju <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [Distributed] Add enable_expert_parallel arg (vllm-project#14305) Signed-off-by: Tyler Michael Smith <[email protected]> * [CI/Build] Use uv python for docker rather than ppa:deadsnakes/ppa (vllm-project#13569) Signed-off-by: mgoin <[email protected]> * [CI] Disable spawn when running V1 Test (vllm-project#14345) Signed-off-by: Thomas Parnell <[email protected]> * [Kernel] Add needs_fixed_stride_order tag to most GEMMs (vllm-project#14306) Signed-off-by: Tyler Michael Smith <[email protected]> * [Bugfix] Fix use_direct_call condition in FusedMoE layer for (vllm-project#14382) Signed-off-by: Tyler Michael Smith <[email protected]> * [Bug] Fix Attention when ignored in by quant_method (vllm-project#14313) Signed-off-by: mgoin <[email protected]> * [V1][Bugfix] Standardize quantized kv cache rejection for attention backends (vllm-project#14221) Signed-off-by: mgoin <[email protected]> * [Docs] Add nsight guide to profiling docs (vllm-project#14298) Signed-off-by: mgoin <[email protected]> * cleanup boolean logic Signed-off-by: Sage Moore <[email protected]> * [Hardware][TPU]Enable ragged paged attention kernel and resolve recompilation issue (vllm-project#14310) Signed-off-by: Chengji Yao <[email protected]> * [Doc] Fix a typo (vllm-project#14385) * [Bugfix] Correctly call `cudaProfilerStop` in benchmarks script (vllm-project#14183) Signed-off-by: Brayden Zhong <[email protected]> * [Perf] Reduce MLA CPU overheads in V1 (vllm-project#14384) Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> * [FP8] Refactor apply_fp8_linear and apply_fp8_linear_generic into an object (vllm-project#14390) Signed-off-by: luka <[email protected]> * [BugFix] Illegal Memory Access in the blockwise cutlass fp8 GEMMs (vllm-project#14396) * [Bugfix] Fix JambaForCausalLM LoRA (vllm-project#14370) Signed-off-by: Jee Jee Li <[email protected]> * [Build] Add nightly wheel fallback when latest commit wheel unavailable (vllm-project#14358) Signed-off-by: Isotr0py <[email protected]> * OpenVINO: added CPU-like conditions (vllm-project#14338) Signed-off-by: Ilya Lavrenov <[email protected]> * [GH] Auto-apply multi-modality label to relevant PRs (vllm-project#14402) Signed-off-by: DarkLight1337 <[email protected]> * correct wrong markdown syntax (vllm-project#14414) Signed-off-by: vincent-pli <[email protected]> * [Bugfix] Further clean up LoRA test (vllm-project#14422) Signed-off-by: Jee Jee Li <[email protected]> * [Bugfix] Clean up multi-modal processors (vllm-project#14417) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Set default value of seed to None (vllm-project#14274) Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <[email protected]> * [BUGFIX] Skip tokenization support for throughput benchmark (vllm-project#12712) Signed-off-by: root <[email protected]> Signed-off-by: Aleksandr Malyshev <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> * Fix missing `kv_caches` and `attn_metadata` in `OpenVINOCausalLM` (vllm-project#14271) Signed-off-by: Harry Mellor <[email protected]> * Use the optimized block sizes after tuning the kernel. (vllm-project#14329) * [V1][Core] Support for Structured Outputs (vllm-project#12388) Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [Doc] Update prefix_caching.md to match the example image (vllm-project#14420) * [Benchmarks] Make detokenization optional in benchmark scripts (vllm-project#11697) Signed-off-by: Jeremy Arnold <[email protected]> * comments Signed-off-by: Sage Moore <[email protected]> * [Kernel] optimize performance of gptq marlin kernel when n is small (vllm-project#14138) Signed-off-by: Jinzhen Lin <[email protected]> * [Misc] Add Phi4-MM example (vllm-project#14343) Signed-off-by: Jee Jee Li <[email protected]> * [v1] torch.compile integration explanation (vllm-project#14437) Signed-off-by: youkaichao <[email protected]> * [V1] Eagerly remove finished requests from the batch (vllm-project#14388) Signed-off-by: Nick Hill <[email protected]> * [V1][Metrics] Fix traceback with preemptions+LoRA (vllm-project#14220) Signed-off-by: Mark McLoughlin <[email protected]> * [Bugfix] Fix torch_xla which can't handle None seed introduced in vllm-project#14274 (vllm-project#14459) Signed-off-by: Yarong Mu <[email protected]> * [V1] Prompt logprobs + APC compatibility; prompt logprobs reqs cannot fill APC (vllm-project#13949) * [Bugfix][V1] Handle MLA in kv_cache_interface (vllm-project#14462) Signed-off-by: Tyler Michael Smith <[email protected]> * Revert "[Perf] Reduce MLA CPU overheads in V1 (vllm-project#14384)" (vllm-project#14471) * [Bugfix][Disaggregated] Add a check in send_kv_caches_and_hidden_states and fix the reshape of the KVCache (vllm-project#14369) Signed-off-by: Mathis Felardos <[email protected]> * [MISC][V1] Register process killing handler only in the main thread (vllm-project#14380) Signed-off-by: Cody Yu <[email protected]> * [core] add `extra_args` to `SamplingParams` (vllm-project#13300) Signed-off-by: Aviv Keshet <[email protected]> * [CI/Build] refactor: set timezone of container to UTC (vllm-project#12888) Signed-off-by: Roger Meier <[email protected]> * Default to `generation_config` from model (vllm-project#12622) Signed-off-by: Harry Mellor <[email protected]> * [Doc]add doc for Qwen models tool calling (vllm-project#14478) Signed-off-by: WangErXiao <[email protected]> * [Doc] Added QwQ-32B to the supported models list in the reasoning out… (vllm-project#14479) Signed-off-by: WangErXiao <[email protected]> * [Bugfix] Make the deviceprofiler include LoRA memory. (vllm-project#14469) Signed-off-by: Jee Jee Li <[email protected]> * Add training doc signposting to TRL (vllm-project#14439) Signed-off-by: Harry Mellor <[email protected]> * [Build/BugFix] Fix hopper 12.8 build (vllm-project#14354) Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * Add RLHF document (vllm-project#14482) Signed-off-by: Harry Mellor <[email protected]> * [CI/Build] Use a fixed seed to avoid flaky tests (vllm-project#14480) Signed-off-by: DarkLight1337 <[email protected]> * [V1] TPU - Add tensor parallel support via Ray (vllm-project#13618) Signed-off-by: Alexander Matveev <[email protected]> * [VLM] Add TP support for Phi-4-MM (vllm-project#14453) Signed-off-by: Isotr0py <[email protected]> * [Misc] add `use_tqdm_on_load` to reduce logs (vllm-project#14407) Signed-off-by: Aaron Pham <[email protected]> * [V1][Core] Fix memory issue with logits & sampling (vllm-project#13776) Signed-off-by: Roger Wang <[email protected]> * [benchmarks] Add option to use unique jsonschema for each request (vllm-project#14457) Signed-off-by: Russell Bryant <[email protected]> * [Misc] Don't run ruff at all on 3rd party libs (vllm-project#14493) Signed-off-by: DarkLight1337 <[email protected]> * Move requirements into their own directory (vllm-project#12547) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix] DeepSeek Accuracy (vllm-project#14476) Signed-off-by: Lucas Wilkinson <[email protected]> * [Bugfix] Fix profiling OOM and decouple encoder multimodal profiling (vllm-project#14361) Signed-off-by: Isotr0py <[email protected]> * Update CODEOWNERS for structured output (vllm-project#14496) Signed-off-by: Russell Bryant <[email protected]> * [Misc] Upgrade to Python 3.9 typing for additional directories (vllm-project#14492) Signed-off-by: DarkLight1337 <[email protected]> * [V1] Support bad_words in sampler (vllm-project#13376) Signed-off-by: 22quinn <[email protected]> Co-authored-by: Nick Hill <[email protected]> * Revert "[V1][Core] Fix memory issue with logits & sampling" (vllm-project#14504) Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]> * [Attention] Default to FlashMLA backend for MLA (vllm-project#14451) Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [V1][TPU] Remove unnecessary padding for running on TPU. (vllm-project#14467) * [Feat] Support chunked prefill for LMCache connector (vllm-project#14505) Signed-off-by: YaoJiayi <[email protected]> * [Bugfix] Fix tqdm progress bar when SamplingParams.n > 1 (vllm-project#12428) Signed-off-by: Yuchen Yan <[email protected]> * [Bugfix] Revert QKVCrossParallelLinear usage in Mllama to keep BNB quantization work (vllm-project#14498) Signed-off-by: Isotr0py <[email protected]> * [Hardware][TPU] Fix the recompiling issue in logits processor after warmup (vllm-project#14510) Signed-off-by: Chengji Yao <[email protected]> * [Misc] Ensure out-of-tree quantization method recognize by cli args (vllm-project#14328) Signed-off-by: liuyanyi <[email protected]> * [Bugfix] Wrong requirements path - rocm (vllm-project#14527) Signed-off-by: Martin Hoyer <[email protected]> * [Feature] Consolidate performance benchmark datasets (vllm-project#14036) Signed-off-by: Jennifer Zhao <[email protected]> Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Jennifer Zhao <[email protected]> Co-authored-by: Roger Wang <[email protected]> * [Misc] Add log information for handle_process_request. (vllm-project#14130) Signed-off-by: chaunceyjiang <[email protected]> * [Docs] Mention `model_impl` arg when explaining Transformers fallback (vllm-project#14552) Signed-off-by: Harry Mellor <[email protected]> * [Frontend] support image embeds (vllm-project#13955) Signed-off-by: chaunceyjiang <[email protected]> * [Kernel] Add more dtype support for GGUF kernels (vllm-project#14043) Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: SzymonOzog <[email protected]> * [Doc] Update PaliGemma note to a warning (vllm-project#14565) Signed-off-by: DarkLight1337 <[email protected]> * V1 rocm support (#469) * Initial commit for V1 successfull compilation * Small improvement for linear * Small improvement for linear * making use of forward_cuda for all except ROPE in llama --------- Co-authored-by: maleksan85 <[email protected]> * nightly_fixed_aiter_integration_final_20250305 README update (#470) * nightly_fixed_aiter_integration_final_20250305 README update (perf results only) * Update Docker Manifest git hash * Update Docker Manifest and added nightly_fixed_aiter_integration_final_20250305 * some more updates * Update AITER section with example * Updated AITER command with larger batch size and model name * Fixing typo * Removed --max-model-len in AITER command * Updating AITER instructions * typo * Another typo * Whitespace * modifying whats new section * Another typo --------- Co-authored-by: arakowsk-amd <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> --------- Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Xiongfei Wei <[email protected]> Signed-off-by: Travis Johnson <[email protected]> Signed-off-by: Divakar Verma <[email protected]> Signed-off-by: Cody Yu <[email protected]> Signed-off-by: Rui Qiao <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: KuntaiDu <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Sage Moore <[email protected]> Signed-off-by: Michael Goin <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: dangshunya <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Roger Wang <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Iacopo Poli <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Daivid Savernin-Frenk <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: vincent-4 <[email protected]> Signed-off-by: Brayden Zhong <[email protected]> Signed-off-by: pyc96 <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Yuan Tang <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Kyle Huang <[email protected]> Signed-off-by: Linkun Chen <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: liuyanyi <[email protected]> Signed-off-by: courage17340 <[email protected]> Signed-off-by: Jitse Klomp <[email protected]> Signed-off-by: Dilip Gowda Bhagavan <[email protected]> Signed-off-by: Rishika Kedia <[email protected]> Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: ZhongYingMatrix <[email protected]> Signed-off-by: Himanshu Jaju <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Ilya Lavrenov <[email protected]> Signed-off-by: vincent-pli <[email protected]> Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <[email protected]> Signed-off-by: root <[email protected]> Signed-off-by: Aleksandr Malyshev <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Jeremy Arnold <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]> Signed-off-by: Yarong Mu <[email protected]> Signed-off-by: Mathis Felardos <[email protected]> Signed-off-by: Aviv Keshet <[email protected]> Signed-off-by: Roger Meier <[email protected]> Signed-off-by: WangErXiao <[email protected]> Signed-off-by: Alexander Matveev <[email protected]> Signed-off-by: 22quinn <[email protected]> Signed-off-by: YaoJiayi <[email protected]> Signed-off-by: Yuchen Yan <[email protected]> Signed-off-by: Martin Hoyer <[email protected]> Signed-off-by: Jennifer Zhao <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: SzymonOzog <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Qubitium-ModelCloud <[email protected]> Co-authored-by: mgoin <[email protected]> Co-authored-by: iefgnoix <[email protected]> Co-authored-by: Travis Johnson <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: Cody Yu <[email protected]> Co-authored-by: Rui Qiao <[email protected]> Co-authored-by: Zhanwen Chen <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: lkchen <[email protected]> Co-authored-by: kushanam <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Kuntai Du <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: rainkert <[email protected]> Co-authored-by: dangshunya <[email protected]> Co-authored-by: Congcong Chen <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Iacopo Poli <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Zhe Zhang <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: DaividFrank <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Vincent <[email protected]> Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: Ye Cao <[email protected]> Co-authored-by: Serena <[email protected]> Co-authored-by: pyc96 <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: Ying Zhong <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Yuan Tang <[email protected]> Co-authored-by: Ce Gao <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: Pavani Majety <[email protected]> Co-authored-by: kYLe <[email protected]> Co-authored-by: NickLucche <[email protected]> Co-authored-by: Yanyi Liu <[email protected]> Co-authored-by: Irina Yuryeva <[email protected]> Co-authored-by: Thomas Parnell <[email protected]> Co-authored-by: courage17340 <[email protected]> Co-authored-by: Jitse Klomp <[email protected]> Co-authored-by: Dilip Gowda Bhagavan <[email protected]> Co-authored-by: Rishika Kedia <[email protected]> Co-authored-by: Burkhard Ringlein <[email protected]> Co-authored-by: Jan van Lunteren <[email protected]> Co-authored-by: Himanshu Jaju <[email protected]> Co-authored-by: Chengji Yao <[email protected]> Co-authored-by: Daniel Li <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Ilya Lavrenov <[email protected]> Co-authored-by: Peng Li <[email protected]> Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: York-RDWang <[email protected]> Co-authored-by: Jeremy Arnold <[email protected]> Co-authored-by: Jinzhen Lin <[email protected]> Co-authored-by: yarongmu-google <[email protected]> Co-authored-by: afeldman-nm <[email protected]> Co-authored-by: Mathis Felardos <[email protected]> Co-authored-by: Aviv Keshet <[email protected]> Co-authored-by: Roger Meier <[email protected]> Co-authored-by: Robin <[email protected]> Co-authored-by: Alexander Matveev <[email protected]> Co-authored-by: 22quinn <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Jiayi Yao <[email protected]> Co-authored-by: Yuchen Yan <[email protected]> Co-authored-by: Martin Hoyer <[email protected]> Co-authored-by: Jennifer Zhao <[email protected]> Co-authored-by: Jennifer Zhao <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: Szymon Ożóg <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Mcirino1 <[email protected]> Co-authored-by: arakowsk-amd <[email protected]>
1 parent 5e31d5c commit a699a11

File tree

379 files changed

+19127
-4444
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

379 files changed

+19127
-4444
lines changed

.buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -426,7 +426,7 @@ main() {
426426

427427
pip install -U transformers
428428

429-
pip install -r requirements-dev.txt
429+
pip install -r requirements/dev.txt
430430
which genai-perf
431431

432432
# check storage

.buildkite/run-amd-test.sh

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,12 @@ if [[ $commands == *" kernels "* ]]; then
9393
--ignore=kernels/test_rand.py \
9494
--ignore=kernels/test_sampler.py \
9595
--ignore=kernels/test_cascade_flash_attn.py \
96-
--ignore=kernels/test_mamba_mixer2.py"
96+
--ignore=kernels/test_mamba_mixer2.py \
97+
--ignore=kernels/test_aqlm.py \
98+
--ignore=kernels/test_machete_mm.py \
99+
--ignore=kernels/test_mha_attn.py \
100+
--ignore=kernels/test_block_fp8.py \
101+
--ignore=kernels/test_permute_cols.py"
97102
fi
98103

99104
#ignore certain Entrypoints tests

.buildkite/run-cpu-test.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ function cpu_tests() {
3535
# Run basic model test
3636
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
3737
set -e
38-
pip install -r vllm/requirements-test.txt
38+
pip install -r vllm/requirements/test.txt
3939
pytest -v -s tests/models/decoder_only/language -m cpu_model
4040
pytest -v -s tests/models/embedding/language -m cpu_model
4141
pytest -v -s tests/models/encoder_decoder/language -m cpu_model

.buildkite/test-pipeline.yaml

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ steps:
3535
fast_check: true
3636
no_gpu: True
3737
commands:
38-
- pip install -r requirements-docs.txt
38+
- pip install -r ../../requirements/docs.txt
3939
- SPHINXOPTS=\"-W\" make html
4040
# Check API reference (if it fails, you may have missing mock imports)
4141
- grep \"sig sig-object py\" build/html/api/inference_params.html
@@ -78,6 +78,7 @@ steps:
7878
- tests/basic_correctness/test_preemption
7979
- tests/basic_correctness/test_cumem.py
8080
commands:
81+
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
8182
- pytest -v -s basic_correctness/test_cumem.py
8283
- pytest -v -s basic_correctness/test_basic_correctness.py
8384
- pytest -v -s basic_correctness/test_cpu_offload.py
@@ -115,6 +116,7 @@ steps:
115116
- tests/entrypoints/test_chat_utils
116117
- tests/entrypoints/offline_mode
117118
commands:
119+
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
118120
- pytest -v -s entrypoints/llm --ignore=entrypoints/llm/test_lazy_outlines.py --ignore=entrypoints/llm/test_generate.py --ignore=entrypoints/llm/test_generate_multiple_loras.py --ignore=entrypoints/llm/test_guided_generate.py --ignore=entrypoints/llm/test_collective_rpc.py
119121
- pytest -v -s entrypoints/llm/test_lazy_outlines.py # it needs a clean process
120122
- pytest -v -s entrypoints/llm/test_generate.py # it needs a clean process
@@ -146,8 +148,10 @@ steps:
146148
- pytest -v -s spec_decode/e2e/test_integration_dist_tp4.py
147149
# TODO: create a dedicated test section for multi-GPU example tests
148150
# when we have multiple distributed example tests
149-
- python3 ../examples/offline_inference/rlhf.py
150-
- RAY_DEDUP_LOGS=0 python3 ../examples/offline_inference/rlhf_colocate.py
151+
- pushd ../examples/offline_inference
152+
- python3 rlhf.py
153+
- RAY_DEDUP_LOGS=0 python3 rlhf_colocate.py
154+
- popd
151155

152156
- label: Metrics, Tracing Test # 10min
153157
num_gpus: 2
@@ -204,6 +208,7 @@ steps:
204208
- VLLM_USE_V1=1 pytest -v -s v1/engine
205209
- VLLM_USE_V1=1 pytest -v -s v1/sample
206210
- VLLM_USE_V1=1 pytest -v -s v1/worker
211+
- VLLM_USE_V1=1 pytest -v -s v1/structured_output
207212
- VLLM_USE_V1=1 pytest -v -s v1/test_stats.py
208213
- VLLM_USE_V1=1 pytest -v -s v1/test_utils.py
209214
# TODO: accuracy does not match, whether setting

.github/mergify.yml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,21 @@ pull_request_rules:
3636
add:
3737
- frontend
3838

39+
- name: label-multi-modality
40+
description: Automatically apply multi-modality label
41+
conditions:
42+
- or:
43+
- files~=^vllm/multimodal/
44+
- files~=^tests/multimodal/
45+
- files~=^tests/models/multimodal/
46+
- files~=^tests/models/*/audio_language/
47+
- files~=^tests/models/*/vision_language/
48+
- files=tests/models/test_vision.py
49+
actions:
50+
label:
51+
add:
52+
- multi-modality
53+
3954
- name: label-structured-output
4055
description: Automatically apply structured-output label
4156
conditions:

.github/workflows/scripts/build.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ python_executable=python3
55

66
# Update paths
77
# Install requirements
8-
$python_executable -m pip install -r requirements-rocm.txt
8+
$python_executable -m pip install -r requirements/rocm.txt
99

1010
# Limit the number of parallel jobs to avoid OOM
1111
export MAX_JOBS=1

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -197,7 +197,7 @@ _build/
197197
hip_compat.h
198198

199199
# Benchmark dataset
200-
benchmarks/*.json
200+
benchmarks/**/*.json
201201

202202
# Linting
203203
actionlint

.pre-commit-config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,8 +44,8 @@ repos:
4444
rev: 0.6.2
4545
hooks:
4646
- id: pip-compile
47-
args: [requirements-test.in, -o, requirements-test.txt]
48-
files: ^requirements-test\.(in|txt)$
47+
args: [requirements/test.in, -o, requirements/test.txt]
48+
files: ^requirements/test\.(in|txt)$
4949
- repo: local
5050
hooks:
5151
- id: mypy-local

.readthedocs.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,4 +18,4 @@ formats: []
1818
# Optionally declare the Python requirements required to build your docs
1919
python:
2020
install:
21-
- requirements: docs/requirements-docs.txt
21+
- requirements: requirements/docs.txt

CMakeLists.txt

Lines changed: 54 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ set(ignoreMe "${VLLM_PYTHON_PATH}")
3131
set(PYTHON_SUPPORTED_VERSIONS "3.9" "3.10" "3.11" "3.12")
3232

3333
# Supported NVIDIA architectures.
34-
set(CUDA_SUPPORTED_ARCHS "7.0;7.2;7.5;8.0;8.6;8.7;8.9;9.0")
34+
set(CUDA_SUPPORTED_ARCHS "7.0;7.2;7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0")
3535

3636
# Supported AMD GPU architectures.
3737
set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1200;gfx1201")
@@ -312,7 +312,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
312312
# Only build Marlin kernels if we are building for at least some compatible archs.
313313
# Keep building Marlin for 9.0 as there are some group sizes and shapes that
314314
# are not supported by Machete yet.
315-
cuda_archs_loose_intersection(MARLIN_ARCHS "8.0;8.6;8.7;8.9;9.0" "${CUDA_ARCHS}")
315+
cuda_archs_loose_intersection(MARLIN_ARCHS "8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0" "${CUDA_ARCHS}")
316316
if (MARLIN_ARCHS)
317317
set(MARLIN_SRCS
318318
"csrc/quantization/fp8/fp8_marlin.cu"
@@ -334,7 +334,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
334334

335335
# Only build AllSpark kernels if we are building for at least some compatible archs.
336336
cuda_archs_loose_intersection(ALLSPARK_ARCHS "8.0;8.6;8.7;8.9" "${CUDA_ARCHS}")
337-
if (ALLSPARK_ARCHS)
337+
if (${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0 AND ALLSPARK_ARCHS)
338338
set(ALLSPARK_SRCS
339339
"csrc/quantization/gptq_allspark/allspark_repack.cu"
340340
"csrc/quantization/gptq_allspark/allspark_qgemm_w8a16.cu")
@@ -345,46 +345,74 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
345345
message(STATUS "Building AllSpark kernels for archs: ${ALLSPARK_ARCHS}")
346346
else()
347347
message(STATUS "Not building AllSpark kernels as no compatible archs found"
348-
" in CUDA target architectures")
348+
" in CUDA target architectures, or CUDA not >= 12.0")
349349
endif()
350350

351+
352+
set(SCALED_MM_3X_ARCHS)
351353
# The cutlass_scaled_mm kernels for Hopper (c3x, i.e. CUTLASS 3.x) require
352-
# CUDA 12.0 or later (and only work on Hopper, 9.0a for now).
353-
cuda_archs_loose_intersection(SCALED_MM_3X_ARCHS "9.0a" "${CUDA_ARCHS}")
354-
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0 AND SCALED_MM_3X_ARCHS)
354+
# CUDA 12.0 or later
355+
cuda_archs_loose_intersection(SCALED_MM_ARCHS "9.0a;" "${CUDA_ARCHS}")
356+
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0 AND SCALED_MM_ARCHS)
355357
set(SRCS
356-
"csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu"
358+
"csrc/quantization/cutlass_w8a8/scaled_mm_c3x_sm90.cu"
357359
"csrc/quantization/cutlass_w8a8/c3x/scaled_mm_sm90_fp8.cu"
358360
"csrc/quantization/cutlass_w8a8/c3x/scaled_mm_sm90_int8.cu"
359361
"csrc/quantization/cutlass_w8a8/c3x/scaled_mm_azp_sm90_int8.cu"
360362
"csrc/quantization/cutlass_w8a8/c3x/scaled_mm_blockwise_sm90_fp8.cu")
361363
set_gencode_flags_for_srcs(
362364
SRCS "${SRCS}"
363-
CUDA_ARCHS "${SCALED_MM_3X_ARCHS}")
365+
CUDA_ARCHS "${SCALED_MM_ARCHS}")
364366
list(APPEND VLLM_EXT_SRC "${SRCS}")
365-
list(APPEND VLLM_GPU_FLAGS "-DENABLE_SCALED_MM_C3X=1")
366-
message(STATUS "Building scaled_mm_c3x for archs: ${SCALED_MM_3X_ARCHS}")
367+
list(APPEND VLLM_GPU_FLAGS "-DENABLE_SCALED_MM_SM90=1")
368+
# Let scaled_mm_c2x know it doesn't need to build these arches
369+
list(APPEND SCALED_MM_3X_ARCHS "${SCALED_MM_ARCHS}")
370+
message(STATUS "Building scaled_mm_c3x_sm90 for archs: ${SCALED_MM_ARCHS}")
367371
else()
368-
if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0 AND SCALED_MM_3X_ARCHS)
369-
message(STATUS "Not building scaled_mm_c3x as CUDA Compiler version is "
372+
if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0 AND SCALED_MM_ARCHS)
373+
message(STATUS "Not building scaled_mm_c3x_sm90 as CUDA Compiler version is "
370374
"not >= 12.0, we recommend upgrading to CUDA 12.0 or "
371375
"later if you intend on running FP8 quantized models on "
372376
"Hopper.")
373377
else()
374-
message(STATUS "Not building scaled_mm_c3x as no compatible archs found "
378+
message(STATUS "Not building scaled_mm_c3x_sm90 as no compatible archs found "
375379
"in CUDA target architectures")
376380
endif()
381+
endif()
377382

378-
# clear SCALED_MM_3X_ARCHS so the scaled_mm_c2x kernels know we didn't
379-
# build any 3x kernels
380-
set(SCALED_MM_3X_ARCHS)
383+
# The cutlass_scaled_mm kernels for Blackwell (c3x, i.e. CUTLASS 3.x) require
384+
# CUDA 12.8 or later
385+
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;10.1a;12.0a" "${CUDA_ARCHS}")
386+
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.8 AND SCALED_MM_ARCHS)
387+
set(SRCS
388+
"csrc/quantization/cutlass_w8a8/scaled_mm_c3x_sm100.cu"
389+
"csrc/quantization/cutlass_w8a8/c3x/scaled_mm_sm100_fp8.cu"
390+
)
391+
set_gencode_flags_for_srcs(
392+
SRCS "${SRCS}"
393+
CUDA_ARCHS "${SCALED_MM_ARCHS}")
394+
list(APPEND VLLM_EXT_SRC "${SRCS}")
395+
list(APPEND VLLM_GPU_FLAGS "-DENABLE_SCALED_MM_SM100=1")
396+
# Let scaled_mm_c2x know it doesn't need to build these arches
397+
list(APPEND SCALED_MM_3X_ARCHS "${SCALED_MM_ARCHS}")
398+
message(STATUS "Building scaled_mm_c3x_sm100 for archs: ${SCALED_MM_ARCHS}")
399+
else()
400+
if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.8 AND SCALED_MM_ARCHS)
401+
message(STATUS "Not building scaled_mm_c3x_sm100 as CUDA Compiler version is "
402+
"not >= 12.8, we recommend upgrading to CUDA 12.8 or "
403+
"later if you intend on running FP8 quantized models on "
404+
"Blackwell.")
405+
else()
406+
message(STATUS "Not building scaled_mm_c3x_100 as no compatible archs found "
407+
"in CUDA target architectures")
408+
endif()
381409
endif()
382410

383411
#
384412
# For the cutlass_scaled_mm kernels we want to build the c2x (CUTLASS 2.x)
385413
# kernels for the remaining archs that are not already built for 3x.
386414
cuda_archs_loose_intersection(SCALED_MM_2X_ARCHS
387-
"7.5;8.0;8.6;8.7;8.9;9.0" "${CUDA_ARCHS}")
415+
"7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0" "${CUDA_ARCHS}")
388416
# subtract out the archs that are already built for 3x
389417
list(REMOVE_ITEM SCALED_MM_2X_ARCHS ${SCALED_MM_3X_ARCHS})
390418
if (SCALED_MM_2X_ARCHS)
@@ -409,17 +437,17 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
409437
# 2:4 Sparse Kernels
410438

411439
# The 2:4 sparse kernels cutlass_scaled_sparse_mm and cutlass_compressor
412-
# require CUDA 12.2 or later (and only work on Hopper, 9.0a for now).
413-
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.2 AND SCALED_MM_3X_ARCHS)
440+
# require CUDA 12.2 or later (and only work on Hopper and Blackwell).
441+
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.2 AND SCALED_MM_ARCHS)
414442
set(SRCS "csrc/sparse/cutlass/sparse_scaled_mm_c3x.cu")
415443
set_gencode_flags_for_srcs(
416444
SRCS "${SRCS}"
417-
CUDA_ARCHS "${SCALED_MM_3X_ARCHS}")
445+
CUDA_ARCHS "${SCALED_MM_ARCHS}")
418446
list(APPEND VLLM_EXT_SRC "${SRCS}")
419447
list(APPEND VLLM_GPU_FLAGS "-DENABLE_SPARSE_SCALED_MM_C3X=1")
420-
message(STATUS "Building sparse_scaled_mm_c3x for archs: ${SCALED_MM_3X_ARCHS}")
448+
message(STATUS "Building sparse_scaled_mm_c3x for archs: ${SCALED_MM_ARCHS}")
421449
else()
422-
if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.2 AND SCALED_MM_3X_ARCHS)
450+
if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.2 AND SCALED_MM_ARCHS)
423451
message(STATUS "Not building sparse_scaled_mm_c3x kernels as CUDA Compiler version is "
424452
"not >= 12.2, we recommend upgrading to CUDA 12.2 or later "
425453
"if you intend on running FP8 sparse quantized models on Hopper.")
@@ -434,8 +462,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
434462
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.8 AND FP4_ARCHS)
435463
set(SRCS
436464
"csrc/quantization/fp4/nvfp4_quant_kernels.cu"
437-
"csrc/quantization/fp4/nvfp4_scaled_mm_kernels.cu"
438-
)
465+
"csrc/quantization/fp4/nvfp4_scaled_mm_kernels.cu")
439466
set_gencode_flags_for_srcs(
440467
SRCS "${SRCS}"
441468
CUDA_ARCHS "${FP4_ARCHS}")
@@ -534,6 +561,7 @@ define_gpu_extension_target(
534561
COMPILE_FLAGS ${VLLM_GPU_FLAGS}
535562
ARCHITECTURES ${VLLM_GPU_ARCHES}
536563
INCLUDE_DIRECTORIES ${CUTLASS_INCLUDE_DIR}
564+
INCLUDE_DIRECTORIES ${CUTLASS_TOOLS_UTIL_INCLUDE_DIR}
537565
USE_SABI 3
538566
WITH_SOABI)
539567

@@ -557,7 +585,7 @@ set_gencode_flags_for_srcs(
557585
CUDA_ARCHS "${CUDA_ARCHS}")
558586

559587
if(VLLM_GPU_LANG STREQUAL "CUDA")
560-
cuda_archs_loose_intersection(MARLIN_MOE_ARCHS "8.0;8.6;8.7;8.9;9.0" "${CUDA_ARCHS}")
588+
cuda_archs_loose_intersection(MARLIN_MOE_ARCHS "8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0" "${CUDA_ARCHS}")
561589
if (MARLIN_MOE_ARCHS)
562590
set(MARLIN_MOE_SRC
563591
"csrc/moe/marlin_kernels/marlin_moe_kernel.h"

0 commit comments

Comments
 (0)