Skip to content

[Usage]: Adaptive Batching and number of concurrent requests #10269

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
Leon-Sander opened this issue Nov 12, 2024 · 4 comments
Closed
1 task done

[Usage]: Adaptive Batching and number of concurrent requests #10269

Leon-Sander opened this issue Nov 12, 2024 · 4 comments
Labels
usage How to use vllm

Comments

@Leon-Sander
Copy link

Leon-Sander commented Nov 12, 2024

Your current environment

I am using the async engine or docker with openai completions endpoint.

How would you like to use vllm

Lets say I am setting max num batched tokens to 50k. It tells me I can run 15 concurrent requests. Now as far as I understood adaptive batching, if one request finished computing, the batch is refilled with the next request ensuring a consistent throughput.

So can I send 1000 requests at once and it would continuously process around 15 requests at a time by refilling the batch?
Or should I only send 15 requests at a time and wait until they are processed to send the next 15?

Is there some limit which would overwhelm the system?

Edit:
When sending a certain amount of requests at once, the engine kept crashing. This confused me and therefore this post. Turns out he engine just crashed because of a timeout as mentioned in #10002.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@Leon-Sander Leon-Sander added the usage How to use vllm label Nov 12, 2024
@mgoin
Copy link
Member

mgoin commented Nov 12, 2024

In most cases, you should simply provide all of your requests at once and the scheduler in vLLM will do it's best job to batch the largest number of requests together based on the kv cache available. The Maximum concurrency for 32k tokens per request: 15.1x message is for the worst case where each request is using the full context length of the model. In practical use you can often get much higher batching.

The one exception to this is sending tens of thousands of requests at once may overload the server if it can only finish them slowly. This shouldn't cause a crash though

@Leon-Sander
Copy link
Author

Thanks. If it does not crash, what would be the result of overloading, and is there a way to calculate the point of overloading?

@Leon-Sander
Copy link
Author

Closing since the initial question was answered, would still be nice if anyone has an idea regarding the overloading.

@mgoin
Copy link
Member

mgoin commented Nov 13, 2024

Overloading without crashing would just be like potentially generation poor performance from many requests in-flight. This is less of an issue now that we separated the server from the scheduler and the scheduler from the engine with multi-processing in >=0.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage How to use vllm
Projects
None yet
Development

No branches or pull requests

2 participants