You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/serving/distributed_serving.md
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -22,7 +22,7 @@ There is one edge case: if the model fits in a single node with multiple GPUs, b
22
22
23
23
vLLM supports distributed tensor-parallel and pipeline-parallel inference and serving. Currently, we support [Megatron-LM's tensor parallel algorithm](https://arxiv.org/pdf/1909.08053.pdf). We manage the distributed runtime with either [Ray](https://github.com/ray-project/ray) or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inferencing currently requires Ray.
24
24
25
-
Multiprocessing will be used by default when not running in a Ray placement group and if there are sufficient GPUs available on the same node for the configured {code}`tensor_parallel_size`, otherwise Ray will be used. This default can be overridden via the {code}`LLM` class {code}`distributed-executor-backend` argument or {code}`--distributed-executor-backend` API server argument. Set it to {code}`mp` for multiprocessing or {code}`ray` for Ray. It's not required for Ray to be installed for the multiprocessing case.
25
+
Multiprocessing will be used by default when not running in a Ray placement group and if there are sufficient GPUs available on the same node for the configured {code}`tensor_parallel_size`, otherwise Ray will be used. This default can be overridden via the {code}`LLM` class {code}`distributed_executor_backend` argument or {code}`--distributed-executor-backend` API server argument. Set it to {code}`mp` for multiprocessing or {code}`ray` for Ray. It's not required for Ray to be installed for the multiprocessing case.
26
26
27
27
To run multi-GPU inference with the {code}`LLM` class, set the {code}`tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs:
By default, vLLM scheduler prioritizes prefills and doesn't batch prefill and decode to the same batch.
@@ -49,13 +49,12 @@ This policy has two benefits:
49
49
- It improves ITL and generation decode because decode requests are prioritized.
50
50
- It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch.
51
51
52
-
You can tune the performance by changing `max_num_batched_tokens`.
53
-
By default, it is set to 512, which has the best ITL on A100 in the initial benchmark (llama 70B and mixtral 8x22B).
52
+
You can tune the performance by changing `max_num_batched_tokens`. By default, it is set to 2048.
54
53
Smaller `max_num_batched_tokens` achieves better ITL because there are fewer prefills interrupting decodes.
55
54
Higher `max_num_batched_tokens` achieves better TTFT as you can put more prefill to the batch.
56
55
57
56
- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the default scheduling policy (except that it still prioritizes decodes).
58
-
- Note that the default value (512) of `max_num_batched_tokens` is optimized for ITL, and it may have lower throughput than the default scheduler.
57
+
- Note that the default value (2048) of `max_num_batched_tokens` is optimized for ITL, and it may have lower throughput than the default scheduler.
59
58
60
59
We recommend you set `max_num_batched_tokens > 2048` for throughput.
# Define the prefix for environment variables to look for
4
+
PREFIX="SM_VLLM_"
5
+
ARG_PREFIX="--"
6
+
7
+
# Initialize an array for storing the arguments
8
+
# port 8080 required by sagemaker, https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html#your-algorithms-inference-code-container-response
9
+
ARGS=(--port 8080)
10
+
11
+
# Loop through all environment variables
12
+
while IFS='='read -r key value;do
13
+
# Remove the prefix from the key, convert to lowercase, and replace underscores with dashes
0 commit comments