Skip to content

Commit 5976f48

Browse files
gshtrasjeejeeleeheheda12345ywang96DarkLight1337
authored
Merge pull request #358 from ROCm/upstream_merge_25_01_13
* [Bugfix][V1] Fix molmo text-only inputs (vllm-project#11676) Signed-off-by: Jee Jee Li <[email protected]> * [Kernel] Move attn_type to Attention.__init__() (vllm-project#11690) Signed-off-by: Chen Zhang <[email protected]> * [V1] Extend beyond image modality and support mixed-modality inference with Llava-OneVision (vllm-project#11685) Signed-off-by: Roger Wang <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> * [Bugfix] Fix LLaVA-NeXT feature size precision error (for real) (vllm-project#11772) Signed-off-by: DarkLight1337 <[email protected]> * [Model] Future-proof Qwen2-Audio multi-modal processor (vllm-project#11776) Signed-off-by: DarkLight1337 <[email protected]> * [XPU] Make pp group initilized for pipeline-parallelism (vllm-project#11648) Signed-off-by: yisheng <[email protected]> * [Doc][3/N] Reorganize Serving section (vllm-project#11766) Signed-off-by: DarkLight1337 <[email protected]> * [Kernel][LoRA]Punica prefill kernels fusion (vllm-project#11234) Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Abatom <[email protected]> Co-authored-by: Zhonghua Deng <[email protected]> * [Bugfix] Update attention interface in `Whisper` (vllm-project#11784) Signed-off-by: Roger Wang <[email protected]> * [CI] Fix neuron CI and run offline tests (vllm-project#11779) Signed-off-by: Liangfu Chen <[email protected]> * fix init error for MessageQueue when n_local_reader is zero (vllm-project#11768) * [Doc] Create a vulnerability management team (vllm-project#9925) Signed-off-by: Russell Bryant <[email protected]> * [CI][CPU] adding build number to docker image name (vllm-project#11788) Signed-off-by: Yuan Zhou <[email protected]> * [V1][Doc] Update V1 support for `LLaVa-NeXT-Video` (vllm-project#11798) Signed-off-by: Roger Wang <[email protected]> * [Bugfix] Comprehensively test and fix LLaVA-NeXT feature size calculation (vllm-project#11800) Signed-off-by: DarkLight1337 <[email protected]> * [doc] add doc to explain how to use uv (vllm-project#11773) Signed-off-by: youkaichao <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [V1] Support audio language models on V1 (vllm-project#11733) Signed-off-by: Roger Wang <[email protected]> * [doc] update how pip can install nightly wheels (vllm-project#11806) Signed-off-by: youkaichao <[email protected]> * [Doc] Add note to `gte-Qwen2` models (vllm-project#11808) Signed-off-by: DarkLight1337 <[email protected]> * [optimization] remove python function call for custom op (vllm-project#11750) Signed-off-by: youkaichao <[email protected]> * [Bugfix] update the prefix for qwen2 (vllm-project#11795) Co-authored-by: jiadi.jjd <[email protected]> * [Doc]Add documentation for using EAGLE in vLLM (vllm-project#11417) Signed-off-by: Sourashis Roy <[email protected]> * [Bugfix] Significant performance drop on CPUs with --num-scheduler-steps > 1 (vllm-project#11794) * [Doc] Group examples into categories (vllm-project#11782) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix] Fix image input for Pixtral-HF (vllm-project#11741) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] sort torch profiler table by kernel timing (vllm-project#11813) * Remove the duplicate imports of MultiModalKwargs and PlaceholderRange… (vllm-project#11824) * Fixed docker build for ppc64le (vllm-project#11518) Signed-off-by: Nishidha Panpaliya <[email protected]> * [OpenVINO] Fixed Docker.openvino build (vllm-project#11732) Signed-off-by: Ilya Lavrenov <[email protected]> * [Bugfix] Add checks for LoRA and CPU offload (vllm-project#11810) Signed-off-by: Jee Jee Li <[email protected]> * [Docs] reorganize sponsorship page (vllm-project#11639) Signed-off-by: simon-mo <[email protected]> * [Bug] Fix pickling of `ModelConfig` when RunAI Model Streamer is used (vllm-project#11825) Signed-off-by: DarkLight1337 <[email protected]> * [misc] improve memory profiling (vllm-project#11809) Signed-off-by: youkaichao <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [doc] update wheels url (vllm-project#11830) Signed-off-by: youkaichao <[email protected]> * [Docs] Update sponsor name: 'Novita' to 'Novita AI' (vllm-project#11833) * [Hardware][Apple] Native support for macOS Apple Silicon (vllm-project#11696) Signed-off-by: Wallas Santos <[email protected]> Co-authored-by: Michael Goin <[email protected]> * [torch.compile] consider relevant code in compilation cache (vllm-project#11614) Signed-off-by: youkaichao <[email protected]> * [VLM] Reorganize profiling/processing-related code (vllm-project#11812) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Move examples into categories (vllm-project#11840) Signed-off-by: Harry Mellor <[email protected]> * [Doc][4/N] Reorganize API Reference (vllm-project#11843) Signed-off-by: DarkLight1337 <[email protected]> * [CI/Build][Bugfix] Fix CPU CI image clean up (vllm-project#11836) Signed-off-by: jiang1.li <[email protected]> * [Bugfix][XPU] fix silu_and_mul (vllm-project#11823) Signed-off-by: yan ma <[email protected]> * [Misc] Move some model utils into vision file (vllm-project#11848) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Expand Multimodal API Reference (vllm-project#11852) Signed-off-by: DarkLight1337 <[email protected]> * [Misc]add some explanations for BlockHashType (vllm-project#11847) * [TPU][Quantization] TPU `W8A8` (vllm-project#11785) Co-authored-by: Woosuk Kwon <[email protected]> * [Kernel][Triton][AMD] Use block size heuristic for avg 2.8x speedup for int8 models (vllm-project#11698) Signed-off-by: Randall Smith <[email protected]> * [Docs] Add Google Cloud Meetup (vllm-project#11864) * [CI] Turn on basic correctness tests for V1 (vllm-project#10864) * treat do_lower_case in the same way as the sentence-transformers library (vllm-project#11815) Signed-off-by: Max de Bayser <[email protected]> * [Doc] Recommend uv and python 3.12 for quickstart guide (vllm-project#11849) Signed-off-by: mgoin <[email protected]> * [Misc] Move `print_*_once` from utils to logger (vllm-project#11298) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]> Co-authored-by: Maxime Fournioux <[email protected]> * [Doc] Intended links Python multiprocessing library (vllm-project#11878) * [perf]fix current stream (vllm-project#11870) Signed-off-by: youkaichao <[email protected]> * [Bugfix] Override dunder methods of placeholder modules (vllm-project#11882) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] fix beam search input errors and latency benchmark script (vllm-project#11875) Signed-off-by: Ye Qi <[email protected]> Co-authored-by: yeq <[email protected]> * [Doc] Add model development API Reference (vllm-project#11884) Signed-off-by: DarkLight1337 <[email protected]> * [platform] Allow platform specify attention backend (vllm-project#11609) Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: Mengqing Cao <[email protected]> Co-authored-by: Mengqing Cao <[email protected]> * [ci]try to fix flaky multi-step tests (vllm-project#11894) Signed-off-by: youkaichao <[email protected]> * [Misc] Provide correct Pixtral-HF chat template (vllm-project#11891) Signed-off-by: DarkLight1337 <[email protected]> * [Docs] Add Modal to deployment frameworks (vllm-project#11907) * [Doc][5/N] Move Community and API Reference to the bottom (vllm-project#11896) Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: Simon Mo <[email protected]> * [VLM] Enable tokenized inputs for merged multi-modal processor (vllm-project#11900) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Show default pooling method in a table (vllm-project#11904) Signed-off-by: DarkLight1337 <[email protected]> * [torch.compile] Hide KV cache behind torch.compile boundary (vllm-project#11677) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix] Validate lora adapters to avoid crashing server (vllm-project#11727) Signed-off-by: Joe Runde <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [BUGFIX] Fix `UnspecifiedPlatform` package name (vllm-project#11916) Signed-off-by: Kunshang Ji <[email protected]> * [ci] fix gh200 tests (vllm-project#11919) Signed-off-by: youkaichao <[email protected]> * [misc] remove python function call for custom activation op (vllm-project#11885) Co-authored-by: youkaichao <[email protected]> * [platform] support pytorch custom op pluggable (vllm-project#11328) Signed-off-by: wangxiyuan <[email protected]> * Replace "online inference" with "online serving" (vllm-project#11923) Signed-off-by: Harry Mellor <[email protected]> * [ci] Fix sampler tests (vllm-project#11922) Signed-off-by: youkaichao <[email protected]> * [Doc] [1/N] Initial guide for merged multi-modal processor (vllm-project#11925) Signed-off-by: DarkLight1337 <[email protected]> * [platform] support custom torch.compile backend key (vllm-project#11318) Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: youkaichao <[email protected]> Co-authored-by: youkaichao <[email protected]> * [Doc] Rename offline inference examples (vllm-project#11927) Signed-off-by: Harry Mellor <[email protected]> * [Docs] Fix docstring in `get_ip` function (vllm-project#11932) Signed-off-by: Kuntai Du <[email protected]> * Doc fix in `benchmark_long_document_qa_throughput.py` (vllm-project#11933) Signed-off-by: Kuntai Du <[email protected]> * [Hardware][CPU] Support MOE models on x86 CPU (vllm-project#11831) Signed-off-by: jiang1.li <[email protected]> * [Misc] Clean up debug code in Deepseek-V3 (vllm-project#11930) Signed-off-by: Isotr0py <[email protected]> * [Misc] Update benchmark_prefix_caching.py fixed example usage (vllm-project#11920) Signed-off-by: Ren MinMin <[email protected]> Co-authored-by: Ren MinMin <[email protected]> * [Bugfix] Check that number of images matches number of <|image|> tokens with mllama (vllm-project#11939) Signed-off-by: Travis Johnson <[email protected]> * [mypy] Fix mypy warnings in api_server.py (vllm-project#11941) Signed-off-by: Fred Reiss <[email protected]> * [ci] fix broken distributed-tests-4-gpus (vllm-project#11937) Signed-off-by: youkaichao <[email protected]> * [Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design (vllm-project#11672) Signed-off-by: Sungjae Lee <[email protected]> * [Bugfix] fused_experts_impl wrong compute type for float32 (vllm-project#11921) Signed-off-by: shaochangxu.scx <[email protected]> Co-authored-by: shaochangxu.scx <[email protected]> * [CI/Build] Move model-specific multi-modal processing tests (vllm-project#11934) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Basic guide for writing unit tests for new models (vllm-project#11951) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Fix RobertaModel loading (vllm-project#11940) Signed-off-by: NickLucche <[email protected]> * [Model] Add cogagent model support vLLM (vllm-project#11742) Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> * [V1] Avoid sending text prompt to core engine (vllm-project#11963) Signed-off-by: Roger Wang <[email protected]> * [CI/Build] Add markdown linter (vllm-project#11857) Signed-off-by: Rafael Vasquez <[email protected]> * [Model] Initialize support for Deepseek-VL2 models (vllm-project#11578) Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Hardware][CPU] Multi-LoRA implementation for the CPU backend (vllm-project#11100) Signed-off-by: Akshat Tripathi <[email protected]> Signed-off-by: Oleg Mosalov <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Oleg Mosalov <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> * [Hardware][TPU] workaround fix for MoE on TPU (vllm-project#11764) * [V1][Core][1/n] Logging and Metrics (vllm-project#11962) Signed-off-by: [email protected] <[email protected]> * [Model] Support GGUF models newly added in `transformers` 4.46.0 (vllm-project#9685) Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [V1] [2/n] Logging and Metrics - `OutputProcessor` Abstraction (vllm-project#11973) Signed-off-by: [email protected] <[email protected]> * [MISC] fix typo in kv transfer send recv test (vllm-project#11983) * [Bug] Fix usage of `.transpose()` and `.view()` consecutively. (vllm-project#11979) * [CI][Spec Decode] fix: broken test for EAGLE model (vllm-project#11972) Signed-off-by: Sungjae Lee <[email protected]> * [Misc] Fix Deepseek V2 fp8 kv-scale remapping (vllm-project#11947) Signed-off-by: Yida Wu <[email protected]> * [Misc]Minor Changes about Worker (vllm-project#11555) Signed-off-by: Chenguang Li <[email protected]> * [platform] add ray_device_key (vllm-project#11948) Signed-off-by: youkaichao <[email protected]> * Fix Max Token ID for Qwen-VL-Chat (vllm-project#11980) Signed-off-by: Alex-Brooks <[email protected]> * [Kernel] unified_attention for Attention.forward (vllm-project#11967) Signed-off-by: Chen Zhang <[email protected]> * [Doc][V1] Update model implementation guide for V1 support (vllm-project#11998) Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Doc] Organise installation documentation into categories and tabs (vllm-project#11935) Signed-off-by: Harry Mellor <[email protected]> * [platform] add device_control env var (vllm-project#12009) Signed-off-by: youkaichao <[email protected]> * [Platform] Move get_punica_wrapper() function to Platform (vllm-project#11516) Signed-off-by: Shanshan Shen <[email protected]> * bugfix: Fix signature mismatch in benchmark's `get_tokenizer` function (vllm-project#11982) Signed-off-by: elijah <[email protected]> * Using list * Revert "[misc] improve memory profiling (vllm-project#11809)" This reverts commit 889e662. * Trying to make scales work with compileable attention * Docs lint --------- Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: Roger Wang <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: yisheng <[email protected]> Signed-off-by: Abatom <[email protected]> Signed-off-by: Liangfu Chen <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Yuan Zhou <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Sourashis Roy <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Ilya Lavrenov <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: Wallas Santos <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: yan ma <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]> Signed-off-by: Ye Qi <[email protected]> Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: Mengqing Cao <[email protected]> Signed-off-by: Joe Runde <[email protected]> Signed-off-by: Kunshang Ji <[email protected]> Signed-off-by: Kuntai Du <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Ren MinMin <[email protected]> Signed-off-by: Travis Johnson <[email protected]> Signed-off-by: Fred Reiss <[email protected]> Signed-off-by: Sungjae Lee <[email protected]> Signed-off-by: shaochangxu.scx <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: Rafael Vasquez <[email protected]> Signed-off-by: Akshat Tripathi <[email protected]> Signed-off-by: Oleg Mosalov <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Chenguang Li <[email protected]> Signed-off-by: Alex-Brooks <[email protected]> Signed-off-by: Shanshan Shen <[email protected]> Signed-off-by: elijah <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> Co-authored-by: YiSheng5 <[email protected]> Co-authored-by: Zhonghua Deng <[email protected]> Co-authored-by: Liangfu Chen <[email protected]> Co-authored-by: XiaobingZhang <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Yuan <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: jiangjiadi <[email protected]> Co-authored-by: jiadi.jjd <[email protected]> Co-authored-by: sroy745 <[email protected]> Co-authored-by: Jie Fu (傅杰) <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: WangErXiao <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Ilya Lavrenov <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Wallas Henrique <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Yan Ma <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Maximilien de Bayser <[email protected]> Co-authored-by: Maxime Fournioux <[email protected]> Co-authored-by: Guspan Tanadi <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: yeq <[email protected]> Co-authored-by: wangxiyuan <[email protected]> Co-authored-by: Mengqing Cao <[email protected]> Co-authored-by: Charles Frye <[email protected]> Co-authored-by: Joe Runde <[email protected]> Co-authored-by: Kunshang Ji <[email protected]> Co-authored-by: cennn <[email protected]> Co-authored-by: Kuntai Du <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: minmin <[email protected]> Co-authored-by: Ren MinMin <[email protected]> Co-authored-by: Travis Johnson <[email protected]> Co-authored-by: Fred Reiss <[email protected]> Co-authored-by: Sungjae Lee <[email protected]> Co-authored-by: shaochangxu <[email protected]> Co-authored-by: shaochangxu.scx <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: sixgod <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Rafael Vasquez <[email protected]> Co-authored-by: Akshat Tripathi <[email protected]> Co-authored-by: Oleg Mosalov <[email protected]> Co-authored-by: Avshalom Manevich <[email protected]> Co-authored-by: Yangcheng Li <[email protected]> Co-authored-by: Siyuan Li <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Chenguang Li <[email protected]> Co-authored-by: Alex Brooks <[email protected]> Co-authored-by: Shanshan Shen <[email protected]> Co-authored-by: elijah <[email protected]>
2 parents 113274a + eb4abfd commit 5976f48

File tree

483 files changed

+12609
-6071
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

483 files changed

+12609
-6071
lines changed

.buildkite/run-cpu-test.sh

Lines changed: 20 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -9,63 +9,60 @@ CORE_RANGE=${CORE_RANGE:-48-95}
99
NUMA_NODE=${NUMA_NODE:-1}
1010

1111
# Try building the docker image
12-
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build -t cpu-test -f Dockerfile.cpu .
13-
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" -t cpu-test-avx2 -f Dockerfile.cpu .
12+
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build -t cpu-test-"$BUILDKITE_BUILD_NUMBER" -f Dockerfile.cpu .
13+
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" -t cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2 -f Dockerfile.cpu .
1414

1515
# Setup cleanup
16-
remove_docker_container() { docker rm -f cpu-test-"$NUMA_NODE" cpu-test-avx2-"$NUMA_NODE" || true; }
16+
remove_docker_container() { set -e; docker rm -f cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" || true; }
1717
trap remove_docker_container EXIT
1818
remove_docker_container
1919

2020
# Run the image, setting --shm-size=4g for tensor parallel.
2121
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus="$CORE_RANGE" \
22-
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-"$NUMA_NODE" cpu-test
22+
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" cpu-test-"$BUILDKITE_BUILD_NUMBER"
2323
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus="$CORE_RANGE" \
24-
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-avx2-"$NUMA_NODE" cpu-test-avx2
24+
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2
2525

2626
function cpu_tests() {
2727
set -e
2828
export NUMA_NODE=$2
2929

3030
# offline inference
31-
docker exec cpu-test-avx2-"$NUMA_NODE" bash -c "
31+
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" bash -c "
3232
set -e
33-
python3 examples/offline_inference.py"
33+
python3 examples/offline_inference/basic.py"
3434

3535
# Run basic model test
36-
docker exec cpu-test-"$NUMA_NODE" bash -c "
36+
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
3737
set -e
38-
pip install pytest pytest-asyncio \
39-
decord einops librosa peft Pillow sentence-transformers soundfile \
40-
transformers_stream_generator matplotlib datamodel_code_generator
41-
pip install torchvision --index-url https://download.pytorch.org/whl/cpu
38+
pip install -r vllm/requirements-test.txt
4239
pytest -v -s tests/models/decoder_only/language -m cpu_model
4340
pytest -v -s tests/models/embedding/language -m cpu_model
4441
pytest -v -s tests/models/encoder_decoder/language -m cpu_model
4542
pytest -v -s tests/models/decoder_only/audio_language -m cpu_model
4643
pytest -v -s tests/models/decoder_only/vision_language -m cpu_model"
4744

4845
# Run compressed-tensor test
49-
docker exec cpu-test-"$NUMA_NODE" bash -c "
46+
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
5047
set -e
5148
pytest -s -v \
5249
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_static_setup \
5350
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_dynamic_per_token"
5451

5552
# Run AWQ test
56-
docker exec cpu-test-"$NUMA_NODE" bash -c "
53+
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
5754
set -e
5855
pytest -s -v \
5956
tests/quantization/test_ipex_quant.py"
6057

6158
# Run chunked-prefill and prefix-cache test
62-
docker exec cpu-test-"$NUMA_NODE" bash -c "
59+
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
6360
set -e
6461
pytest -s -v -k cpu_model \
6562
tests/basic_correctness/test_chunked_prefill.py"
6663

67-
# online inference
68-
docker exec cpu-test-"$NUMA_NODE" bash -c "
64+
# online serving
65+
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
6966
set -e
7067
export VLLM_CPU_KVCACHE_SPACE=10
7168
export VLLM_CPU_OMP_THREADS_BIND=$1
@@ -78,6 +75,12 @@ function cpu_tests() {
7875
--num-prompts 20 \
7976
--endpoint /v1/completions \
8077
--tokenizer facebook/opt-125m"
78+
79+
# Run multi-lora tests
80+
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
81+
set -e
82+
pytest -s -v \
83+
tests/lora/test_qwen2vl.py"
8184
}
8285

8386
# All of CPU tests are expected to be finished less than 25 mins.

.buildkite/run-gh200-test.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,5 +24,5 @@ remove_docker_container
2424

2525
# Run the image and test offline inference
2626
docker run --name gh200-test --gpus=all --entrypoint="" gh200-test bash -c '
27-
python3 examples/offline_inference.py
27+
python3 examples/offline_inference/basic.py
2828
'

.buildkite/run-hpu-test.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,4 +13,4 @@ trap remove_docker_container EXIT
1313
remove_docker_container
1414

1515
# Run the image and launch offline inference
16-
docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference.py
16+
docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference/basic.py

.buildkite/run-neuron-test.sh

Lines changed: 27 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,18 @@
33
# This script build the Neuron docker image and run the API server inside the container.
44
# It serves a sanity check for compilation and basic model usage.
55
set -e
6+
set -v
7+
8+
image_name="neuron/vllm-ci"
9+
container_name="neuron_$(tr -dc A-Za-z0-9 < /dev/urandom | head -c 10; echo)"
10+
11+
HF_CACHE="$(realpath ~)/huggingface"
12+
mkdir -p "${HF_CACHE}"
13+
HF_MOUNT="/root/.cache/huggingface"
14+
15+
NEURON_COMPILE_CACHE_URL="$(realpath ~)/neuron_compile_cache"
16+
mkdir -p "${NEURON_COMPILE_CACHE_URL}"
17+
NEURON_COMPILE_CACHE_MOUNT="/root/.cache/neuron_compile_cache"
618

719
# Try building the docker image
820
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com
@@ -13,41 +25,30 @@ if [ -f /tmp/neuron-docker-build-timestamp ]; then
1325
last_build=$(cat /tmp/neuron-docker-build-timestamp)
1426
current_time=$(date +%s)
1527
if [ $((current_time - last_build)) -gt 86400 ]; then
28+
docker image prune -f
1629
docker system prune -f
30+
rm -rf "${HF_MOUNT:?}/*"
31+
rm -rf "${NEURON_COMPILE_CACHE_MOUNT:?}/*"
1732
echo "$current_time" > /tmp/neuron-docker-build-timestamp
1833
fi
1934
else
2035
date "+%s" > /tmp/neuron-docker-build-timestamp
2136
fi
2237

23-
docker build -t neuron -f Dockerfile.neuron .
38+
docker build -t "${image_name}" -f Dockerfile.neuron .
2439

2540
# Setup cleanup
26-
remove_docker_container() { docker rm -f neuron || true; }
41+
remove_docker_container() {
42+
docker image rm -f "${image_name}" || true;
43+
}
2744
trap remove_docker_container EXIT
28-
remove_docker_container
2945

3046
# Run the image
31-
docker run --device=/dev/neuron0 --device=/dev/neuron1 --network host --name neuron neuron python3 -m vllm.entrypoints.api_server \
32-
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --max-num-seqs 8 --max-model-len 128 --block-size 128 --device neuron --tensor-parallel-size 2 &
33-
34-
# Wait for the server to start
35-
wait_for_server_to_start() {
36-
timeout=300
37-
counter=0
38-
39-
while [ "$(curl -s -o /dev/null -w '%{http_code}' localhost:8000/health)" != "200" ]; do
40-
sleep 1
41-
counter=$((counter + 1))
42-
if [ $counter -ge $timeout ]; then
43-
echo "Timeout after $timeout seconds"
44-
break
45-
fi
46-
done
47-
}
48-
wait_for_server_to_start
49-
50-
# Test a simple prompt
51-
curl -X POST -H "Content-Type: application/json" \
52-
localhost:8000/generate \
53-
-d '{"prompt": "San Francisco is a"}'
47+
docker run --rm -it --device=/dev/neuron0 --device=/dev/neuron1 --network host \
48+
-v "${HF_CACHE}:${HF_MOUNT}" \
49+
-e "HF_HOME=${HF_MOUNT}" \
50+
-v "${NEURON_COMPILE_CACHE_URL}:${NEURON_COMPILE_CACHE_MOUNT}" \
51+
-e "NEURON_COMPILE_CACHE_URL=${NEURON_COMPILE_CACHE_MOUNT}" \
52+
--name "${container_name}" \
53+
${image_name} \
54+
/bin/bash -c "python3 /workspace/vllm/examples/offline_inference/neuron.py"

.buildkite/run-openvino-test.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,4 +13,4 @@ trap remove_docker_container EXIT
1313
remove_docker_container
1414

1515
# Run the image and launch offline inference
16-
docker run --network host --env VLLM_OPENVINO_KVCACHE_SPACE=1 --name openvino-test openvino-test python3 /workspace/examples/offline_inference.py
16+
docker run --network host --env VLLM_OPENVINO_KVCACHE_SPACE=1 --name openvino-test openvino-test python3 /workspace/examples/offline_inference/basic.py

.buildkite/run-tpu-test.sh

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,4 +14,13 @@ remove_docker_container
1414
# For HF_TOKEN.
1515
source /etc/environment
1616
# Run a simple end-to-end example.
17-
docker run --privileged --net host --shm-size=16G -it -e "HF_TOKEN=$HF_TOKEN" --name tpu-test vllm-tpu /bin/bash -c "python3 -m pip install git+https://github.com/thuml/depyf.git && python3 -m pip install pytest && python3 -m pip install lm_eval[api]==0.4.4 && pytest -v -s /workspace/vllm/tests/entrypoints/openai/test_accuracy.py && pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py && python3 /workspace/vllm/tests/tpu/test_compilation.py && python3 /workspace/vllm/examples/offline_inference_tpu.py"
17+
docker run --privileged --net host --shm-size=16G -it \
18+
-e "HF_TOKEN=$HF_TOKEN" --name tpu-test \
19+
vllm-tpu /bin/bash -c "python3 -m pip install git+https://github.com/thuml/depyf.git \
20+
&& python3 -m pip install pytest \
21+
&& python3 -m pip install lm_eval[api]==0.4.4 \
22+
&& pytest -v -s /workspace/vllm/tests/entrypoints/openai/test_accuracy.py \
23+
&& pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py \
24+
&& python3 /workspace/vllm/tests/tpu/test_compilation.py \
25+
&& python3 /workspace/vllm/tests/tpu/test_quantization_accuracy.py \
26+
&& python3 /workspace/vllm/examples/offline_inference/tpu.py"

.buildkite/run-xpu-test.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,6 @@ remove_docker_container
1414

1515
# Run the image and test offline inference/tensor parallel
1616
docker run --name xpu-test --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path --entrypoint="" xpu-test sh -c '
17-
python3 examples/offline_inference.py
18-
python3 examples/offline_inference_cli.py -tp 2
17+
python3 examples/offline_inference/basic.py
18+
python3 examples/offline_inference/cli.py -tp 2
1919
'

.buildkite/test-pipeline.yaml

Lines changed: 22 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ steps:
3838
- pip install -r requirements-docs.txt
3939
- SPHINXOPTS=\"-W\" make html
4040
# Check API reference (if it fails, you may have missing mock imports)
41-
- grep \"sig sig-object py\" build/html/dev/sampling_params.html
41+
- grep \"sig sig-object py\" build/html/api/inference_params.html
4242

4343
- label: Async Engine, Inputs, Utils, Worker Test # 24min
4444
fast_check: true
@@ -52,6 +52,7 @@ steps:
5252
- tests/worker
5353
- tests/standalone_tests/lazy_torch_compile.py
5454
commands:
55+
- pip install git+https://github.com/Isotr0py/DeepSeek-VL2.git # Used by multimoda processing test
5556
- python3 standalone_tests/lazy_torch_compile.py
5657
- pytest -v -s mq_llm_engine # MQLLMEngine
5758
- pytest -v -s async_engine # AsyncLLMEngine
@@ -187,19 +188,19 @@ steps:
187188
- examples/
188189
commands:
189190
- pip install tensorizer # for tensorizer test
190-
- python3 offline_inference.py
191-
- python3 cpu_offload.py
192-
- python3 offline_inference_chat.py
193-
- python3 offline_inference_with_prefix.py
194-
- python3 llm_engine_example.py
195-
- python3 offline_inference_vision_language.py
196-
- python3 offline_inference_vision_language_multi_image.py
197-
- python3 tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
198-
- python3 offline_inference_encoder_decoder.py
199-
- python3 offline_inference_classification.py
200-
- python3 offline_inference_embedding.py
201-
- python3 offline_inference_scoring.py
202-
- python3 offline_profile.py --model facebook/opt-125m run_num_steps --num-steps 2
191+
- python3 offline_inference/basic.py
192+
- python3 offline_inference/cpu_offload.py
193+
- python3 offline_inference/chat.py
194+
- python3 offline_inference/prefix_caching.py
195+
- python3 offline_inference/llm_engine_example.py
196+
- python3 offline_inference/vision_language.py
197+
- python3 offline_inference/vision_language_multi_image.py
198+
- python3 other/tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 other/tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
199+
- python3 offline_inference/encoder_decoder.py
200+
- python3 offline_inference/classification.py
201+
- python3 offline_inference/embedding.py
202+
- python3 offline_inference/scoring.py
203+
- python3 offline_inference/profiling.py --model facebook/opt-125m run_num_steps --num-steps 2
203204

204205
- label: Prefix Caching Test # 9min
205206
mirror_hardwares: [amd]
@@ -214,6 +215,7 @@ steps:
214215
- vllm/model_executor/layers
215216
- vllm/sampling_metadata.py
216217
- tests/samplers
218+
- tests/conftest.py
217219
commands:
218220
- pytest -v -s samplers
219221
- VLLM_USE_FLASHINFER_SAMPLER=1 pytest -v -s samplers
@@ -229,20 +231,22 @@ steps:
229231
- pytest -v -s test_logits_processor.py
230232
- pytest -v -s model_executor/test_guided_processors.py
231233

232-
- label: Speculative decoding tests # 30min
234+
- label: Speculative decoding tests # 40min
233235
source_file_dependencies:
234236
- vllm/spec_decode
235237
- tests/spec_decode
238+
- vllm/model_executor/models/eagle.py
236239
commands:
237240
- pytest -v -s spec_decode/e2e/test_multistep_correctness.py
238241
- VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s spec_decode --ignore=spec_decode/e2e/test_multistep_correctness.py
242+
- pytest -v -s spec_decode/e2e/test_eagle_correctness.py
239243

240244
- label: LoRA Test %N # 15min each
241245
mirror_hardwares: [amd]
242246
source_file_dependencies:
243247
- vllm/lora
244248
- tests/lora
245-
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_long_context.py --ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py
249+
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_long_context.py --ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py --ignore=lora/test_minicpmv_tp.py
246250
parallelism: 4
247251

248252
- label: "PyTorch Fullgraph Smoke Test" # 9min
@@ -367,6 +371,7 @@ steps:
367371
- tests/models/encoder_decoder/vision_language
368372
commands:
369373
- pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
374+
- pytest -v -s models/multimodal
370375
- pytest -v -s models/decoder_only/audio_language -m 'core_model or quant_model'
371376
- pytest -v -s --ignore models/decoder_only/vision_language/test_phi3v.py models/decoder_only/vision_language -m 'core_model or quant_model'
372377
- pytest -v -s models/embedding/vision_language -m core_model
@@ -535,6 +540,7 @@ steps:
535540
# requires multi-GPU testing for validation.
536541
- pytest -v -s -x lora/test_chatglm3_tp.py
537542
- pytest -v -s -x lora/test_llama_tp.py
543+
- pytest -v -s -x lora/test_minicpmv_tp.py
538544

539545

540546
- label: Weight Loading Multiple GPU Test # 33min

.github/workflows/sphinx-lint.yml renamed to .github/workflows/doc-lint.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ on:
1313
- "docs/**"
1414

1515
jobs:
16-
sphinx-lint:
16+
doc-lint:
1717
runs-on: ubuntu-latest
1818
strategy:
1919
matrix:
@@ -29,4 +29,4 @@ jobs:
2929
python -m pip install --upgrade pip
3030
pip install -r requirements-lint.txt
3131
- name: Linting docs
32-
run: tools/sphinx-lint.sh
32+
run: tools/doc-lint.sh

.gitignore

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -79,10 +79,7 @@ instance/
7979

8080
# Sphinx documentation
8181
docs/_build/
82-
docs/source/getting_started/examples/*.rst
83-
!**/*.template.rst
84-
docs/source/getting_started/examples/*.md
85-
!**/*.template.md
82+
docs/source/getting_started/examples/
8683

8784
# PyBuilder
8885
.pybuilder/

Dockerfile

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22
# to run the OpenAI compatible server.
33

44
# Please update any changes made here to
5-
# docs/source/dev/dockerfile/dockerfile.md and
6-
# docs/source/assets/dev/dockerfile-stages-dependency.png
5+
# docs/source/contributing/dockerfile/dockerfile.md and
6+
# docs/source/assets/contributing/dockerfile-stages-dependency.png
77

88
ARG CUDA_VERSION=12.4.1
99
#################### BASE BUILD IMAGE ####################
@@ -250,7 +250,7 @@ ENV VLLM_USAGE_SOURCE production-docker-image
250250
# define sagemaker first, so it is not default from `docker build`
251251
FROM vllm-openai-base AS vllm-sagemaker
252252

253-
COPY examples/sagemaker-entrypoint.sh .
253+
COPY examples/online_serving/sagemaker-entrypoint.sh .
254254
RUN chmod +x sagemaker-entrypoint.sh
255255
ENTRYPOINT ["./sagemaker-entrypoint.sh"]
256256

Dockerfile.neuron

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@ RUN apt-get update && \
1515
ffmpeg libsm6 libxext6 libgl1
1616

1717
### Mount Point ###
18-
# When launching the container, mount the code directory to /app
19-
ARG APP_MOUNT=/app
18+
# When launching the container, mount the code directory to /workspace
19+
ARG APP_MOUNT=/workspace
2020
VOLUME [ ${APP_MOUNT} ]
2121
WORKDIR ${APP_MOUNT}/vllm
2222

@@ -25,6 +25,7 @@ RUN python3 -m pip install --no-cache-dir fastapi ninja tokenizers pandas
2525
RUN python3 -m pip install sentencepiece transformers==4.45.2 -U
2626
RUN python3 -m pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
2727
RUN python3 -m pip install neuronx-cc==2.16.345.0 --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
28+
RUN python3 -m pip install pytest
2829

2930
COPY . .
3031
ARG GIT_REPO_CHECK=0
@@ -42,4 +43,7 @@ RUN --mount=type=bind,source=.git,target=.git \
4243
# install development dependencies (for testing)
4344
RUN python3 -m pip install -e tests/vllm_test_utils
4445

46+
# overwrite entrypoint to run bash script
47+
RUN echo "import subprocess; import sys; subprocess.check_call(sys.argv[1:])" > /usr/local/bin/dockerd-entrypoint.py
48+
4549
CMD ["/bin/bash"]

Dockerfile.openvino

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ ARG GIT_REPO_CHECK=0
1414
RUN --mount=type=bind,source=.git,target=.git \
1515
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi
1616

17+
RUN python3 -m pip install -U pip
1718
# install build requirements
1819
RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" python3 -m pip install -r /workspace/requirements-build.txt
1920
# build vLLM with OpenVINO backend

0 commit comments

Comments
 (0)