[Usage]: Adaptive Batching and number of concurrent requests #10269

Leon-Sander · 2024-11-12T20:40:45Z

Your current environment

I am using the async engine or docker with openai completions endpoint.

How would you like to use vllm

Lets say I am setting max num batched tokens to 50k. It tells me I can run 15 concurrent requests. Now as far as I understood adaptive batching, if one request finished computing, the batch is refilled with the next request ensuring a consistent throughput.

So can I send 1000 requests at once and it would continuously process around 15 requests at a time by refilling the batch?
Or should I only send 15 requests at a time and wait until they are processed to send the next 15?

Is there some limit which would overwhelm the system?

Edit:
When sending a certain amount of requests at once, the engine kept crashing. This confused me and therefore this post. Turns out he engine just crashed because of a timeout as mentioned in #10002.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

mgoin · 2024-11-12T20:47:17Z

In most cases, you should simply provide all of your requests at once and the scheduler in vLLM will do it's best job to batch the largest number of requests together based on the kv cache available. The Maximum concurrency for 32k tokens per request: 15.1x message is for the worst case where each request is using the full context length of the model. In practical use you can often get much higher batching.

The one exception to this is sending tens of thousands of requests at once may overload the server if it can only finish them slowly. This shouldn't cause a crash though

Leon-Sander · 2024-11-12T20:55:51Z

Thanks. If it does not crash, what would be the result of overloading, and is there a way to calculate the point of overloading?

Leon-Sander · 2024-11-13T12:26:52Z

Closing since the initial question was answered, would still be nice if anyone has an idea regarding the overloading.

mgoin · 2024-11-13T17:31:25Z

Overloading without crashing would just be like potentially generation poor performance from many requests in-flight. This is less of an issue now that we separated the server from the scheduler and the scheduler from the engine with multi-processing in >=0.6

Leon-Sander added the usage How to use vllm label Nov 12, 2024

Leon-Sander closed this as completed Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Usage]: Adaptive Batching and number of concurrent requests #10269

[Usage]: Adaptive Batching and number of concurrent requests #10269

Leon-Sander commented Nov 12, 2024 •

edited

Loading

mgoin commented Nov 12, 2024

Uh oh!

Leon-Sander commented Nov 12, 2024

Uh oh!

Leon-Sander commented Nov 13, 2024

Uh oh!

mgoin commented Nov 13, 2024

Uh oh!

Uh oh!

[Usage]: Adaptive Batching and number of concurrent requests #10269

[Usage]: Adaptive Batching and number of concurrent requests #10269

Comments

Leon-Sander commented Nov 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Your current environment

How would you like to use vllm

Before submitting a new issue...

mgoin commented Nov 12, 2024

Uh oh!

Leon-Sander commented Nov 12, 2024

Uh oh!

Leon-Sander commented Nov 13, 2024

Uh oh!

mgoin commented Nov 13, 2024

Uh oh!

Leon-Sander commented Nov 12, 2024 •

edited

Loading