[Sampler] Vectorized sampler #820

Yard1 · 2023-08-22T00:28:29Z

Ready for initial review.

Using the modified throughput benchmark with python benchmarks/benchmark_throughput.py --backend vllm --dataset "./ShareGPT_V3_unfiltered_cleaned_split.json" --model meta-llama/Llama-2-13b-chat-hf --tokenizer meta-llama/Llama-2-13b-chat-hf --num-prompts=1000 and one A100-80, I get the following results:

baseline: Throughput: 3.70 requests/s, 1771.58 tokens/s
this PR: Throughput: 4.77 requests/s, 2280.04 tokens/s

which would be a 1.28x improvement.

The argmax outputs between this PR and master match. Random sampling is different, but that is expected (the text still makes sense and doesn't differ much from master).

Signed-off-by: Antoni Baum <[email protected]>

Yard1 · 2023-08-23T01:51:59Z

Updated benchmark to also apply top_p.

baseline: Throughput: 3.58 requests/s, 1710.60 tokens/s
PR: Throughput: 4.69 requests/s, 2243.01 tokens/s

1.31x improvement

Yard1 · 2023-08-23T01:54:28Z

I have also compared argmax outputs of the benchmark on the ShareGPT dataset, and they are identical between master and this PR (meaning that there is no impact on correctness).

vllm/model_executor/layers/sampler.py

Signed-off-by: Antoni Baum <[email protected]>

Yard1 · 2023-08-23T17:50:09Z

Using modified benchmark and llama-2-7b on A100-80:

baseline: Throughput: 4.67 requests/s, 2233.30 tokens/s
PR: Throughput: 6.84 requests/s, 3272.51 tokens/s

x1.47 improvement

As expected, the improvement is bigger the smaller the model (and thus more time is spent in sampling).

scv119 · 2023-08-23T22:49:33Z

benchmark llama70-b with A100-40 * 4, 512 context length and 128 generation, 500 requests

baseline: dur_s 133.22 tokens_per_s 1807.24 qps 3.75 successful_responses 500
PR: dur_s 114.26 tokens_per_s 2108.07 qps 4.38 successful_responses 500

vllm/model_executor/layers/sampler.py

Signed-off-by: Antoni Baum <[email protected]>

WoosukKwon

@Yard1, thanks for submitting the PR. Before I go deeper into the PR, could you please update the branch with the latest commit?

WoosukKwon · 2023-09-05T13:04:13Z

benchmarks/benchmark_throughput.py

@@ -78,15 +73,19 @@ def run_vllm(
    )

    # Add the requests to the engine.
+    do_sample = False


A dumb question: What is this variable used for?

Allows us to alternate between greedy and stochastic sampling for requests

WoosukKwon · 2023-09-05T13:06:40Z

benchmarks/benchmark_throughput.py

+    outputs = llm._run_engine(use_tqdm=True)
    end = time.time()
+    with open("output.txt", "w") as f:
+        for output in outputs:
+            f.write(output.__repr__() + "\n")


I guess this is to check if vLLM generated valid outputs, right? If so, I believe this ought to be done by our test code, rather than in the benchmarking code.

Yeah, I will revert the changes to the benchmarking script later.

vllm/sampling_params.py

vllm/model_executor/layers/sampler.py

scv119

also consider add a test?

Signed-off-by: Antoni Baum <[email protected]>

scv119 · 2023-09-08T02:07:38Z

lgtm

Signed-off-by: Antoni Baum <[email protected]>

WoosukKwon · 2023-09-12T06:24:42Z

@Yard1 I got an error when running examples/llm_engine_example.py:

File "/home/workspace/vllm/vllm/model_executor/layers/sampler.py", line 98, in forward
    assert len(temperatures) == non_greedy_logits.shape[0]
AssertionError

Signed-off-by: Antoni Baum <[email protected]>

Yard1 · 2023-09-12T07:14:52Z

@WoosukKwon Good catch! Just pushed a fix.

vllm/worker/worker.py

WoosukKwon

@Yard1 Thanks for the PR. It seems the performance improvement is quite significant! However, I'm a bit worried that this PR 1) includes unnecessary changes, and 2) complicates the sampler logic.

For 1), could you revert back the unnecessary changes? I pointed out a few places in the code.
For 2), do you have any idea to simplify the code? I think the complexity mainly comes from a lot of tensor/list indexing operations, and sort-gathers. Can you somehow reduce the use of them?

vllm/worker/worker.py

vllm/model_executor/layers/sampler.py

tests/samplers/test_sampler.py

vllm/model_executor/layers/sampler.py

WoosukKwon · 2023-09-12T07:21:38Z

vllm/model_executor/layers/sampler.py

+                p = torch.tensor(top_ps,
+                                 dtype=logits.dtype,
+                                 device=logits.device)
+                k = torch.tensor(top_ks, dtype=torch.int, device=logits.device)
+                _apply_top_p_top_k_in_place(non_greedy_logits, p, k)


What is the benefit of creating p and k outside the function? It seems the change has no effect.

right now it's necessary for the p's and k's to be indexed correctly

Co-authored-by: Woosuk Kwon <[email protected]>

Yard1 · 2023-09-12T16:57:22Z

@WoosukKwon Thanks for the feedback, let me see if I can reduce unnecessary changes.

As for the compelxity, the current code is optimized for memory and speed which sacrifices simplicty. I am not sure if it's possible to reduce the complexity without impacting performance. I am happy to add more comments and change the code structure for more readability.

Signed-off-by: Antoni Baum <[email protected]>

…sampler_speedup

Yard1 · 2023-09-22T06:18:15Z

Closed in favor of #1048

Signed-off-by: Chendi Xue <[email protected]>

Yard1 added 9 commits August 19, 2023 16:11

Update benchmark code

afb9a6e

Signed-off-by: Antoni Baum <[email protected]>

WIP

349597d

Signed-off-by: Antoni Baum <[email protected]>

Fix benchmark

ed5af82

Signed-off-by: Antoni Baum <[email protected]>

WIP

b99d749

Signed-off-by: Antoni Baum <[email protected]>

Beam search

e0f47fe

Signed-off-by: Antoni Baum <[email protected]>

WIP

c2d365d

Signed-off-by: Antoni Baum <[email protected]>

Add comments

caeaad3

Signed-off-by: Antoni Baum <[email protected]>

WIP

e1f1e4f

Signed-off-by: Antoni Baum <[email protected]>

WIP

2eeac87

Signed-off-by: Antoni Baum <[email protected]>

Yard1 commented Aug 23, 2023

View reviewed changes

vllm/model_executor/layers/sampler.py Show resolved Hide resolved

Update vllm/model_executor/layers/sampler.py

8d81f87

hongqing1986 reviewed Aug 23, 2023

View reviewed changes

vllm/model_executor/layers/sampler.py Outdated Show resolved Hide resolved

Yard1 commented Aug 23, 2023

View reviewed changes

vllm/model_executor/layers/sampler.py Outdated Show resolved Hide resolved

Yard1 added 3 commits August 23, 2023 09:25

Update vllm/model_executor/layers/sampler.py

82458d7

Fix assert

0bca9f1

Signed-off-by: Antoni Baum <[email protected]>

Lint

3aaa397

Signed-off-by: Antoni Baum <[email protected]>

Yard1 marked this pull request as ready for review August 23, 2023 17:39

Yard1 changed the title ~~[WIP][Sampler] Vectorized sampler~~ [Sampler] Vectorized sampler Aug 23, 2023

Yard1 commented Aug 25, 2023

View reviewed changes

vllm/model_executor/layers/sampler.py Outdated Show resolved Hide resolved

Yard1 added 2 commits August 24, 2023 21:02

Set replacement=True in multinomial

fae01a3

Merge branch 'main' into sampler_speedup

35004ea

zhuohan123 force-pushed the main branch from 3affdce to 0080d83 Compare August 30, 2023 09:26

Yard1 added 2 commits August 31, 2023 17:54

WIP

7b95a7b

Signed-off-by: Antoni Baum <[email protected]>

Merge branch 'upstream_main' into sampler_speedup

68632bc

WoosukKwon self-requested a review September 4, 2023 16:00

WoosukKwon reviewed Sep 5, 2023

View reviewed changes

scv119 reviewed Sep 6, 2023

View reviewed changes

vllm/model_executor/layers/sampler.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/sampler.py Show resolved Hide resolved

vllm/model_executor/layers/sampler.py Show resolved Hide resolved

scv119 reviewed Sep 6, 2023

View reviewed changes

vllm/model_executor/layers/sampler.py Outdated Show resolved Hide resolved

scv119 reviewed Sep 6, 2023

View reviewed changes

Yard1 added 4 commits September 6, 2023 13:30

Apply feedback from code review

a995f06

Signed-off-by: Antoni Baum <[email protected]>

Merge branch 'upstream_main' into sampler_speedup

6a2b1b5

Fix all beam case

e8935e7

Signed-off-by: Antoni Baum <[email protected]>

Add test, fix issues

1b69567

Signed-off-by: Antoni Baum <[email protected]>

Yard1 requested a review from scv119 September 8, 2023 00:52

Lint

3e51e1f

Signed-off-by: Antoni Baum <[email protected]>

Yard1 added 2 commits September 11, 2023 21:42

Merge branch 'upstream_main' into sampler_speedup

a64d8b7

Tweak

5779d2d

Signed-off-by: Antoni Baum <[email protected]>

Fix

db133d4

Signed-off-by: Antoni Baum <[email protected]>

Yard1 commented Sep 12, 2023

View reviewed changes

vllm/worker/worker.py Outdated Show resolved Hide resolved

Update vllm/worker/worker.py

7871574

WoosukKwon requested changes Sep 12, 2023

View reviewed changes

Yard1 and others added 2 commits September 12, 2023 09:54

Update tests/samplers/test_sampler.py

5a2615c

Co-authored-by: Woosuk Kwon <[email protected]>

Update vllm/model_executor/layers/sampler.py

471a8bf

Co-authored-by: Woosuk Kwon <[email protected]>

Yard1 added 2 commits September 12, 2023 11:25

Nits

f7d4c82

Signed-off-by: Antoni Baum <[email protected]>

Merge branch 'sampler_speedup' of https://github.com/Yard1/vllm into …

d39834d

…sampler_speedup

zhuohan123 mentioned this pull request Sep 15, 2023

[Sampler] Vectorized sampling (simplified) #1048

Merged

2 tasks

WoosukKwon mentioned this pull request Sep 18, 2023

[v0.2.0] Release Tracker #1089

Closed

5 tasks

Yard1 closed this Sep 22, 2023

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Feb 20, 2025

[DeepseekR1]enable multi nodes (vllm-project#820)

f797cd5

Signed-off-by: Chendi Xue <[email protected]>

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Mar 14, 2025

[DeepseekR1]enable multi nodes (vllm-project#820)

4a4b4c2

Signed-off-by: Chendi Xue <[email protected]>

Uh oh!

[Sampler] Vectorized sampler #820

[Sampler] Vectorized sampler #820

Uh oh!

Conversation

Yard1 commented Aug 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Yard1 commented Aug 23, 2023

Uh oh!

Yard1 commented Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Yard1 commented Aug 23, 2023

Uh oh!

scv119 commented Aug 23, 2023

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Sep 5, 2023

Choose a reason for hiding this comment

Uh oh!

Yard1 Sep 5, 2023

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Sep 5, 2023

Choose a reason for hiding this comment

Uh oh!

Yard1 Sep 5, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scv119 left a comment

Choose a reason for hiding this comment

Uh oh!

scv119 commented Sep 8, 2023

Uh oh!

WoosukKwon commented Sep 12, 2023

Uh oh!

Yard1 commented Sep 12, 2023

Uh oh!

Uh oh!

WoosukKwon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WoosukKwon Sep 12, 2023

Choose a reason for hiding this comment

Uh oh!

Yard1 Sep 12, 2023

Choose a reason for hiding this comment

Uh oh!

Yard1 commented Sep 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Yard1 commented Sep 22, 2023

Uh oh!

Uh oh!

Yard1 commented Aug 22, 2023 •

edited

Loading

Yard1 commented Aug 23, 2023 •

edited

Loading

WoosukKwon left a comment •

edited

Loading

Yard1 commented Sep 12, 2023 •

edited

Loading