Skip to content

Commit 9ba3dbb

Browse files
hmellorrasmith
authored andcommitted
[Doc] Move examples into categories (vllm-project#11840)
Signed-off-by: Harry Mellor <[email protected]>
1 parent 106d379 commit 9ba3dbb

File tree

116 files changed

+153
-124
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

116 files changed

+153
-124
lines changed

.buildkite/run-cpu-test.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ function cpu_tests() {
3030
# offline inference
3131
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" bash -c "
3232
set -e
33-
python3 examples/offline_inference.py"
33+
python3 examples/offline_inference/offline_inference.py"
3434

3535
# Run basic model test
3636
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "

.buildkite/run-gh200-test.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,5 +24,5 @@ remove_docker_container
2424

2525
# Run the image and test offline inference
2626
docker run --name gh200-test --gpus=all --entrypoint="" gh200-test bash -c '
27-
python3 examples/offline_inference.py
27+
python3 examples/offline_inference/offline_inference.py
2828
'

.buildkite/run-hpu-test.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,4 +13,4 @@ trap remove_docker_container EXIT
1313
remove_docker_container
1414

1515
# Run the image and launch offline inference
16-
docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference.py
16+
docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference/offline_inference.py

.buildkite/run-neuron-test.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,4 +51,4 @@ docker run --rm -it --device=/dev/neuron0 --device=/dev/neuron1 --network host \
5151
-e "NEURON_COMPILE_CACHE_URL=${NEURON_COMPILE_CACHE_MOUNT}" \
5252
--name "${container_name}" \
5353
${image_name} \
54-
/bin/bash -c "python3 /workspace/vllm/examples/offline_inference_neuron.py"
54+
/bin/bash -c "python3 /workspace/vllm/examples/offline_inference/offline_inference_neuron.py"

.buildkite/run-openvino-test.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,4 +13,4 @@ trap remove_docker_container EXIT
1313
remove_docker_container
1414

1515
# Run the image and launch offline inference
16-
docker run --network host --env VLLM_OPENVINO_KVCACHE_SPACE=1 --name openvino-test openvino-test python3 /workspace/examples/offline_inference.py
16+
docker run --network host --env VLLM_OPENVINO_KVCACHE_SPACE=1 --name openvino-test openvino-test python3 /workspace/examples/offline_inference/offline_inference.py

.buildkite/run-tpu-test.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,4 +14,4 @@ remove_docker_container
1414
# For HF_TOKEN.
1515
source /etc/environment
1616
# Run a simple end-to-end example.
17-
docker run --privileged --net host --shm-size=16G -it -e "HF_TOKEN=$HF_TOKEN" --name tpu-test vllm-tpu /bin/bash -c "python3 -m pip install git+https://github.com/thuml/depyf.git && python3 -m pip install pytest && python3 -m pip install lm_eval[api]==0.4.4 && pytest -v -s /workspace/vllm/tests/entrypoints/openai/test_accuracy.py && pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py && python3 /workspace/vllm/tests/tpu/test_compilation.py && python3 /workspace/vllm/examples/offline_inference_tpu.py"
17+
docker run --privileged --net host --shm-size=16G -it -e "HF_TOKEN=$HF_TOKEN" --name tpu-test vllm-tpu /bin/bash -c "python3 -m pip install git+https://github.com/thuml/depyf.git && python3 -m pip install pytest && python3 -m pip install lm_eval[api]==0.4.4 && pytest -v -s /workspace/vllm/tests/entrypoints/openai/test_accuracy.py && pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py && python3 /workspace/vllm/tests/tpu/test_compilation.py && python3 /workspace/vllm/examples/offline_inference/offline_inference_tpu.py"

.buildkite/run-xpu-test.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,6 @@ remove_docker_container
1414

1515
# Run the image and test offline inference/tensor parallel
1616
docker run --name xpu-test --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path --entrypoint="" xpu-test sh -c '
17-
python3 examples/offline_inference.py
18-
python3 examples/offline_inference_cli.py -tp 2
17+
python3 examples/offline_inference/offline_inference.py
18+
python3 examples/offline_inference/offline_inference_cli.py -tp 2
1919
'

.buildkite/test-pipeline.yaml

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -187,19 +187,19 @@ steps:
187187
- examples/
188188
commands:
189189
- pip install tensorizer # for tensorizer test
190-
- python3 offline_inference.py
191-
- python3 cpu_offload.py
192-
- python3 offline_inference_chat.py
193-
- python3 offline_inference_with_prefix.py
194-
- python3 llm_engine_example.py
195-
- python3 offline_inference_vision_language.py
196-
- python3 offline_inference_vision_language_multi_image.py
197-
- python3 tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
198-
- python3 offline_inference_encoder_decoder.py
199-
- python3 offline_inference_classification.py
200-
- python3 offline_inference_embedding.py
201-
- python3 offline_inference_scoring.py
202-
- python3 offline_profile.py --model facebook/opt-125m run_num_steps --num-steps 2
190+
- python3 offline_inference/offline_inference.py
191+
- python3 offline_inference/cpu_offload.py
192+
- python3 offline_inference/offline_inference_chat.py
193+
- python3 offline_inference/offline_inference_with_prefix.py
194+
- python3 offline_inference/llm_engine_example.py
195+
- python3 offline_inference/offline_inference_vision_language.py
196+
- python3 offline_inference/offline_inference_vision_language_multi_image.py
197+
- python3 other/tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 other/tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
198+
- python3 offline_inference/offline_inference_encoder_decoder.py
199+
- python3 offline_inference/offline_inference_classification.py
200+
- python3 offline_inference/offline_inference_embedding.py
201+
- python3 offline_inference/offline_inference_scoring.py
202+
- python3 offline_inference/offline_profile.py --model facebook/opt-125m run_num_steps --num-steps 2
203203

204204
- label: Prefix Caching Test # 9min
205205
mirror_hardwares: [amd]

.github/workflows/lint-and-deploy.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ jobs:
2727
version: v3.10.1
2828

2929
- name: Run chart-testing (lint)
30-
run: ct lint --target-branch ${{ github.event.repository.default_branch }} --chart-dirs examples/chart-helm --charts examples/chart-helm
30+
run: ct lint --target-branch ${{ github.event.repository.default_branch }} --chart-dirs examples/online_serving/chart-helm --charts examples/online_serving/chart-helm
3131

3232
- name: Setup minio
3333
run: |
@@ -64,7 +64,7 @@ jobs:
6464
run: |
6565
export AWS_ACCESS_KEY_ID=minioadmin
6666
export AWS_SECRET_ACCESS_KEY=minioadmin
67-
helm install --wait --wait-for-jobs --timeout 5m0s --debug --create-namespace --namespace=ns-vllm test-vllm examples/chart-helm -f examples/chart-helm/values.yaml --set secrets.s3endpoint=http://minio:9000 --set secrets.s3bucketname=testbucket --set secrets.s3accesskeyid=$AWS_ACCESS_KEY_ID --set secrets.s3accesskey=$AWS_SECRET_ACCESS_KEY --set resources.requests.cpu=1 --set resources.requests.memory=4Gi --set resources.limits.cpu=2 --set resources.limits.memory=5Gi --set image.env[0].name=VLLM_CPU_KVCACHE_SPACE --set image.env[1].name=VLLM_LOGGING_LEVEL --set-string image.env[0].value="1" --set-string image.env[1].value="DEBUG" --set-string extraInit.s3modelpath="opt-125m/" --set-string 'resources.limits.nvidia\.com/gpu=0' --set-string 'resources.requests.nvidia\.com/gpu=0' --set-string image.repository="vllm-cpu-env"
67+
helm install --wait --wait-for-jobs --timeout 5m0s --debug --create-namespace --namespace=ns-vllm test-vllm examples/online_serving/chart-helm -f examples/online_serving/chart-helm/values.yaml --set secrets.s3endpoint=http://minio:9000 --set secrets.s3bucketname=testbucket --set secrets.s3accesskeyid=$AWS_ACCESS_KEY_ID --set secrets.s3accesskey=$AWS_SECRET_ACCESS_KEY --set resources.requests.cpu=1 --set resources.requests.memory=4Gi --set resources.limits.cpu=2 --set resources.limits.memory=5Gi --set image.env[0].name=VLLM_CPU_KVCACHE_SPACE --set image.env[1].name=VLLM_LOGGING_LEVEL --set-string image.env[0].value="1" --set-string image.env[1].value="DEBUG" --set-string extraInit.s3modelpath="opt-125m/" --set-string 'resources.limits.nvidia\.com/gpu=0' --set-string 'resources.requests.nvidia\.com/gpu=0' --set-string image.repository="vllm-cpu-env"
6868
6969
- name: curl test
7070
run: |

Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -250,7 +250,7 @@ ENV VLLM_USAGE_SOURCE production-docker-image
250250
# define sagemaker first, so it is not default from `docker build`
251251
FROM vllm-openai-base AS vllm-sagemaker
252252

253-
COPY examples/sagemaker-entrypoint.sh .
253+
COPY examples/online_serving/sagemaker-entrypoint.sh .
254254
RUN chmod +x sagemaker-entrypoint.sh
255255
ENTRYPOINT ["./sagemaker-entrypoint.sh"]
256256

docs/source/contributing/profiling/profiling_index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ Set the env variable VLLM_RPC_TIMEOUT to a big number before you start the serve
2626

2727
### Offline Inference
2828

29-
Refer to <gh-file:examples/offline_inference_with_profiler.py> for an example.
29+
Refer to <gh-file:examples/offline_inference/offline_inference_with_profiler.py> for an example.
3030

3131
### OpenAI Server
3232

docs/source/deployment/frameworks/skypilot.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ run: |
6161
6262
echo 'Starting gradio server...'
6363
git clone https://github.com/vllm-project/vllm.git || true
64-
python vllm/examples/gradio_openai_chatbot_webserver.py \
64+
python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
6565
-m $MODEL_NAME \
6666
--port 8811 \
6767
--model-url http://localhost:8081/v1 \
@@ -321,7 +321,7 @@ run: |
321321
322322
echo 'Starting gradio server...'
323323
git clone https://github.com/vllm-project/vllm.git || true
324-
python vllm/examples/gradio_openai_chatbot_webserver.py \
324+
python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
325325
-m $MODEL_NAME \
326326
--port 8811 \
327327
--model-url http://$ENDPOINT/v1 \

docs/source/features/disagg_prefill.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Disaggregated prefill DOES NOT improve throughput.
2121

2222
## Usage example
2323

24-
Please refer to `examples/disaggregated_prefill.sh` for the example usage of disaggregated prefilling.
24+
Please refer to `examples/online_serving/disaggregated_prefill.sh` for the example usage of disaggregated prefilling.
2525

2626
## Benchmarks
2727

docs/source/features/lora.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ outputs = llm.generate(
4747
)
4848
```
4949

50-
Check out <gh-file:examples/multilora_inference.py> for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options.
50+
Check out <gh-file:examples/offline_inference/multilora_inference.py> for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options.
5151

5252
## Serving LoRA Adapters
5353

docs/source/features/quantization/auto_awq.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ print(f'Model is quantized and saved at "{quant_path}"')
4747
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
4848

4949
```console
50-
$ python examples/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
50+
$ python examples/offline_inference/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
5151
```
5252

5353
AWQ models are also supported directly through the LLM entrypoint:

docs/source/features/quantization/fp8_e4m3_kvcache.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ Here is an example of how to enable this feature:
2828

2929
```python
3030
# two float8_e4m3fn kv cache scaling factor files are provided under tests/fp8_kv, please refer to
31-
# https://github.com/vllm-project/vllm/blob/main/examples/fp8/README.md to generate kv_cache_scales.json of your own.
31+
# https://github.com/vllm-project/vllm/blob/main/examples/other/fp8/README.md to generate kv_cache_scales.json of your own.
3232

3333
from vllm import LLM, SamplingParams
3434
sampling_params = SamplingParams(temperature=1.3, top_p=0.8)

docs/source/features/structured_outputs.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -131,7 +131,7 @@ completion = client.chat.completions.create(
131131
print(completion.choices[0].message.content)
132132
```
133133

134-
Full example: <gh-file:examples/openai_chat_completion_structured_outputs.py>
134+
Full example: <gh-file:examples/online_serving/openai_chat_completion_structured_outputs.py>
135135

136136
## Experimental Automatic Parsing (OpenAI API)
137137

@@ -257,4 +257,4 @@ outputs = llm.generate(
257257
print(outputs[0].outputs[0].text)
258258
```
259259

260-
Full example: <gh-file:examples/offline_inference_structured_outputs.py>
260+
Full example: <gh-file:examples/offline_inference/offline_inference_structured_outputs.py>

docs/source/generate_examples.py

Lines changed: 25 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
def fix_case(text: str) -> str:
1313
subs = {
1414
"api": "API",
15+
"Cli": "CLI",
1516
"cpu": "CPU",
1617
"llm": "LLM",
1718
"tpu": "TPU",
@@ -58,7 +59,7 @@ def generate(self) -> str:
5859
content = f"# {self.title}\n\n{self.description}\n\n"
5960
content += "```{toctree}\n"
6061
content += f":caption: {self.caption}\n:maxdepth: {self.maxdepth}\n"
61-
content += "\n".join(sorted(self.documents)) + "\n```\n"
62+
content += "\n".join(self.documents) + "\n```\n"
6263
return content
6364

6465

@@ -131,11 +132,14 @@ def generate(self) -> str:
131132
ROOT_DIR)
132133

133134
content = f"Source <gh-file:{self.path.relative_to(ROOT_DIR)}>.\n\n"
134-
if self.main_file.suffix == ".py":
135-
content += f"# {self.title}\n\n"
136135
include = "include" if self.main_file.suffix == ".md" else \
137136
"literalinclude"
138-
content += f":::{{{include}}} {make_relative(self.main_file)}\n:::\n\n"
137+
if include == "literalinclude":
138+
content += f"# {self.title}\n\n"
139+
content += f":::{{{include}}} {make_relative(self.main_file)}\n"
140+
if include == "literalinclude":
141+
content += f":language: {self.main_file.suffix[1:]}\n"
142+
content += ":::\n\n"
139143

140144
if not self.other_files:
141145
return content
@@ -163,14 +167,16 @@ def generate_examples():
163167
description=
164168
"A collection of examples demonstrating usage of vLLM.\nAll documented examples are autogenerated using <gh-file:docs/source/generate_examples.py> from examples found in <gh-file:examples>.", # noqa: E501
165169
caption="Examples",
166-
maxdepth=1) # TODO change to 2 when examples start being categorised
170+
maxdepth=2)
171+
# Category indices stored in reverse order because they are inserted into
172+
# examples_index.documents at index 0 in order
167173
category_indices = {
168-
"offline_inference":
174+
"other":
169175
Index(
170-
path=EXAMPLE_DOC_DIR / "examples_offline_inference_index.md",
171-
title="Offline Inference",
176+
path=EXAMPLE_DOC_DIR / "examples_other_index.md",
177+
title="Other",
172178
description=
173-
"Offline inference examples demonstrate how to use vLLM in an offline setting, where the model is queried for predictions in batches.", # noqa: E501
179+
"Other examples that don't strongly fit into the online or offline serving categories.", # noqa: E501
174180
caption="Examples",
175181
),
176182
"online_serving":
@@ -181,31 +187,30 @@ def generate_examples():
181187
"Online serving examples demonstrate how to use vLLM in an online setting, where the model is queried for predictions in real-time.", # noqa: E501
182188
caption="Examples",
183189
),
184-
"other":
190+
"offline_inference":
185191
Index(
186-
path=EXAMPLE_DOC_DIR / "examples_other_index.md",
187-
title="Other",
192+
path=EXAMPLE_DOC_DIR / "examples_offline_inference_index.md",
193+
title="Offline Inference",
188194
description=
189-
"Other examples that don't strongly fit into the online or offline serving categories.", # noqa: E501
195+
"Offline inference examples demonstrate how to use vLLM in an offline setting, where the model is queried for predictions in batches.", # noqa: E501
190196
caption="Examples",
191197
),
192198
}
193199

194200
examples = []
201+
glob_patterns = ["*.py", "*.md", "*.sh"]
195202
# Find categorised examples
196203
for category in category_indices:
197204
category_dir = EXAMPLE_DIR / category
198-
py = category_dir.glob("*.py")
199-
md = category_dir.glob("*.md")
200-
for path in itertools.chain(py, md):
205+
globs = [category_dir.glob(pattern) for pattern in glob_patterns]
206+
for path in itertools.chain(*globs):
201207
examples.append(Example(path, category))
202208
# Find examples in subdirectories
203209
for path in category_dir.glob("*/*.md"):
204210
examples.append(Example(path.parent, category))
205211
# Find uncategorised examples
206-
py = EXAMPLE_DIR.glob("*.py")
207-
md = EXAMPLE_DIR.glob("*.md")
208-
for path in itertools.chain(py, md):
212+
globs = [EXAMPLE_DIR.glob(pattern) for pattern in glob_patterns]
213+
for path in itertools.chain(*globs):
209214
examples.append(Example(path))
210215
# Find examples in subdirectories
211216
for path in EXAMPLE_DIR.glob("*/*.md"):
@@ -215,7 +220,7 @@ def generate_examples():
215220
examples.append(Example(path.parent))
216221

217222
# Generate the example documentation
218-
for example in examples:
223+
for example in sorted(examples, key=lambda e: e.path.stem):
219224
doc_path = EXAMPLE_DOC_DIR / f"{example.path.stem}.md"
220225
with open(doc_path, "w+") as f:
221226
f.write(example.generate())

docs/source/getting_started/installation/cpu-x86.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ $ VLLM_TARGET_DEVICE=cpu python setup.py install
9595
$ sudo apt-get install libtcmalloc-minimal4 # install TCMalloc library
9696
$ find / -name *libtcmalloc* # find the dynamic link library path
9797
$ export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD # prepend the library to LD_PRELOAD
98-
$ python examples/offline_inference.py # run vLLM
98+
$ python examples/offline_inference/offline_inference.py # run vLLM
9999
```
100100

101101
- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 30 and 31 for the framework and using CPU 0-29 for OpenMP:
@@ -132,7 +132,7 @@ CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ
132132

133133
# On this platform, it is recommend to only bind openMP threads on logical CPU cores 0-7 or 8-15
134134
$ export VLLM_CPU_OMP_THREADS_BIND=0-7
135-
$ python examples/offline_inference.py
135+
$ python examples/offline_inference/offline_inference.py
136136
```
137137

138138
- If using vLLM CPU backend on a multi-socket machine with NUMA, be aware to set CPU cores using `VLLM_CPU_OMP_THREADS_BIND` to avoid cross NUMA node memory access.

docs/source/getting_started/installation/xpu.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,4 +71,4 @@ $ --pipeline-parallel-size=2 \
7171
$ -tp=8
7272
```
7373

74-
By default, a ray instance will be launched automatically if no existing one is detected in system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the <gh-file:examples/run_cluster.sh> helper script.
74+
By default, a ray instance will be launched automatically if no existing one is detected in system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the <gh-file:examples/online_serving/run_cluster.sh> helper script.

docs/source/getting_started/quickstart.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ For non-CUDA platforms, please refer [here](#installation-index) for specific in
3131

3232
## Offline Batched Inference
3333

34-
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference.py>
34+
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference/offline_inference.py>
3535

3636
The first line of this example imports the classes {class}`~vllm.LLM` and {class}`~vllm.SamplingParams`:
3737

@@ -133,7 +133,7 @@ completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",
133133
print("Completion result:", completion)
134134
```
135135

136-
A more detailed client example can be found here: <gh-file:examples/openai_completion_client.py>
136+
A more detailed client example can be found here: <gh-file:examples/online_serving/openai_completion_client.py>
137137

138138
### OpenAI Chat Completions API with vLLM
139139

0 commit comments

Comments
 (0)