You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using the async engine or docker with openai completions endpoint.
How would you like to use vllm
Lets say I am setting max num batched tokens to 50k. It tells me I can run 15 concurrent requests. Now as far as I understood adaptive batching, if one request finished computing, the batch is refilled with the next request ensuring a consistent throughput.
So can I send 1000 requests at once and it would continuously process around 15 requests at a time by refilling the batch?
Or should I only send 15 requests at a time and wait until they are processed to send the next 15?
Is there some limit which would overwhelm the system?
Edit:
When sending a certain amount of requests at once, the engine kept crashing. This confused me and therefore this post. Turns out he engine just crashed because of a timeout as mentioned in #10002.
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
In most cases, you should simply provide all of your requests at once and the scheduler in vLLM will do it's best job to batch the largest number of requests together based on the kv cache available. The Maximum concurrency for 32k tokens per request: 15.1x message is for the worst case where each request is using the full context length of the model. In practical use you can often get much higher batching.
The one exception to this is sending tens of thousands of requests at once may overload the server if it can only finish them slowly. This shouldn't cause a crash though
Overloading without crashing would just be like potentially generation poor performance from many requests in-flight. This is less of an issue now that we separated the server from the scheduler and the scheduler from the engine with multi-processing in >=0.6
Uh oh!
There was an error while loading. Please reload this page.
Your current environment
I am using the async engine or docker with openai completions endpoint.
How would you like to use vllm
Lets say I am setting max num batched tokens to 50k. It tells me I can run 15 concurrent requests. Now as far as I understood adaptive batching, if one request finished computing, the batch is refilled with the next request ensuring a consistent throughput.
So can I send 1000 requests at once and it would continuously process around 15 requests at a time by refilling the batch?
Or should I only send 15 requests at a time and wait until they are processed to send the next 15?
Is there some limit which would overwhelm the system?
Edit:
When sending a certain amount of requests at once, the engine kept crashing. This confused me and therefore this post. Turns out he engine just crashed because of a timeout as mentioned in #10002.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: