Skip to content

Commit fc6d0c2

Browse files
reidliu41reidliu41
and
reidliu41
authored
[Misc] improve docs (#18734)
Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]>
1 parent 753944f commit fc6d0c2

File tree

2 files changed

+59
-43
lines changed

2 files changed

+59
-43
lines changed

examples/offline_inference/neuron_eagle.py

Lines changed: 43 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -15,40 +15,46 @@
1515
"What is annapurna labs?",
1616
]
1717

18-
# Create a sampling params object.
19-
sampling_params = SamplingParams(top_k=1, max_tokens=500, ignore_eos=True)
20-
21-
# Create an LLM.
22-
llm = LLM(
23-
model="/home/ubuntu/model_hf/Meta-Llama-3.1-70B-Instruct",
24-
speculative_config={
25-
"model": "/home/ubuntu/model_hf/Llama-3.1-70B-Instruct-EAGLE-Draft",
26-
"num_speculative_tokens": 5,
27-
"max_model_len": 2048,
28-
},
29-
max_num_seqs=4,
30-
# The max_model_len and block_size arguments are required to be same as
31-
# max sequence length when targeting neuron device.
32-
# Currently, this is a known limitation in continuous batching support
33-
# in neuronx-distributed-inference.
34-
max_model_len=2048,
35-
block_size=2048,
36-
# The device can be automatically detected when AWS Neuron SDK is installed.
37-
# The device argument can be either unspecified for automated detection,
38-
# or explicitly assigned.
39-
device="neuron",
40-
tensor_parallel_size=32,
41-
override_neuron_config={
42-
"enable_eagle_speculation": True,
43-
"enable_fused_speculation": True,
44-
},
45-
)
46-
47-
# Generate texts from the prompts. The output is a list of RequestOutput objects
48-
# that contain the prompt, generated text, and other information.
49-
outputs = llm.generate(prompts, sampling_params)
50-
# Print the outputs.
51-
for output in outputs:
52-
prompt = output.prompt
53-
generated_text = output.outputs[0].text
54-
print(f"Prompt: {prompt!r}, \n\n\n\ Generated text: {generated_text!r}")
18+
19+
def main():
20+
# Create a sampling params object.
21+
sampling_params = SamplingParams(top_k=1, max_tokens=500, ignore_eos=True)
22+
23+
# Create an LLM.
24+
llm = LLM(
25+
model="/home/ubuntu/model_hf/Meta-Llama-3.1-70B-Instruct",
26+
speculative_config={
27+
"model": "/home/ubuntu/model_hf/Llama-3.1-70B-Instruct-EAGLE-Draft",
28+
"num_speculative_tokens": 5,
29+
"max_model_len": 2048,
30+
},
31+
max_num_seqs=4,
32+
# The max_model_len and block_size arguments are required to be same as
33+
# max sequence length when targeting neuron device.
34+
# Currently, this is a known limitation in continuous batching support
35+
# in neuronx-distributed-inference.
36+
max_model_len=2048,
37+
block_size=2048,
38+
# The device can be automatically detected when AWS Neuron SDK is installed.
39+
# The device argument can be either unspecified for automated detection,
40+
# or explicitly assigned.
41+
device="neuron",
42+
tensor_parallel_size=32,
43+
override_neuron_config={
44+
"enable_eagle_speculation": True,
45+
"enable_fused_speculation": True,
46+
},
47+
)
48+
49+
# Generate texts from the prompts. The output is a list of RequestOutput objects
50+
# that contain the prompt, generated text, and other information.
51+
outputs = llm.generate(prompts, sampling_params)
52+
# Print the outputs.
53+
for output in outputs:
54+
prompt = output.prompt
55+
generated_text = output.outputs[0].text
56+
print(f"Prompt: {prompt!r}, \n\n\n\ Generated text: {generated_text!r}")
57+
58+
59+
if __name__ == "__main__":
60+
main()

examples/offline_inference/qwen2_5_omni/README.md

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,14 +6,19 @@ This folder provides several example scripts on how to inference Qwen2.5-Omni of
66

77
```bash
88
# Audio + image + video
9-
python examples/offline_inference/qwen2_5_omni/only_thinker.py -q mixed_modalities
9+
python examples/offline_inference/qwen2_5_omni/only_thinker.py \
10+
-q mixed_modalities
1011

1112
# Read vision and audio inputs from a single video file
1213
# NOTE: V1 engine does not support interleaved modalities yet.
13-
VLLM_USE_V1=0 python examples/offline_inference/qwen2_5_omni/only_thinker.py -q use_audio_in_video
14+
VLLM_USE_V1=0 \
15+
python examples/offline_inference/qwen2_5_omni/only_thinker.py \
16+
-q use_audio_in_video
1417

1518
# Multiple audios
16-
VLLM_USE_V1=0 python examples/offline_inference/qwen2_5_omni/only_thinker.py -q multi_audios
19+
VLLM_USE_V1=0 \
20+
python examples/offline_inference/qwen2_5_omni/only_thinker.py \
21+
-q multi_audios
1722
```
1823

1924
This script will run the thinker part of Qwen2.5-Omni, and generate text response.
@@ -22,11 +27,16 @@ You can also test Qwen2.5-Omni on a single modality:
2227

2328
```bash
2429
# Process audio inputs
25-
python examples/offline_inference/audio_language.py --model-type qwen2_5_omni
30+
python examples/offline_inference/audio_language.py \
31+
--model-type qwen2_5_omni
2632

2733
# Process image inputs
28-
python examples/offline_inference/vision_language.py --modality image --model-type qwen2_5_omni
34+
python examples/offline_inference/vision_language.py \
35+
--modality image \
36+
--model-type qwen2_5_omni
2937

3038
# Process video inputs
31-
python examples/offline_inference/vision_language.py --modality video --model-type qwen2_5_omni
39+
python examples/offline_inference/vision_language.py \
40+
--modality video \
41+
--model-type qwen2_5_omni
3242
```

0 commit comments

Comments
 (0)