Skip to content

Commit 5372efb

Browse files
authored
Enhancements to LLM Instance Gateway: Scheduling Logic, and Documentation Updates (#78)
* squashed modify filter for LoRA affinity modify filter for LoRA affinity * update llm service and llm server pool yaml, readme * remove ununsed method from metrics.go * add flowchart image * update size flowchart image * remove image name * update queueingThresholdLoRA to 50 * roll back manifest changes * roll back manifest changes * update filter and scheduler based on comments * rename filters * update filter names and comments * fix readme * fix comment * modify flowchart * add comment to lowLoRACostPredicate reasoning when it can be useful
1 parent 83f701b commit 5372efb

File tree

5 files changed

+70
-12
lines changed

5 files changed

+70
-12
lines changed

docs/schedular-flowchart.png

400 KB
Loading

examples/poc/manifests/vllm/vllm-lora-deployment.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -136,4 +136,4 @@ spec:
136136
emptyDir:
137137
medium: Memory
138138
- name: adapters
139-
emptyDir: {}
139+
emptyDir: {}

pkg/README.md

+15-4
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,12 @@ The current manifests rely on Envoy Gateway [v1.2.1](https://gateway.envoyproxy.
77

88
1. **Deploy Sample vLLM Application**
99

10-
A sample vLLM deployment with the proper protocol to work with LLM Instance Gateway can be found [here](https://github.com/kubernetes-sigs/llm-instance-gateway/blob/6f9869d6595d2d0f8e6febcbec0f348cb44a3012/examples/poc/manifests/samples/vllm-lora-deployment.yaml#L18).
10+
A sample vLLM deployment with the proper protocol to work with LLM Instance Gateway can be found [here](https://github.com/kubernetes-sigs/llm-instance-gateway/tree/main/examples/poc/manifests/vllm/vllm-lora-deployment.yaml#L18).
11+
12+
1. **Deploy LLM Service and LLMServerPool**
13+
14+
You can find a sample LLM service and LLMServerPool configuration, based on the vLLM deployments mentioned above, [here](https://github.com/kubernetes-sigs/llm-instance-gateway/tree/main/examples/poc/manifests/llmservice.yaml).
15+
1116

1217
1. **Update Envoy Gateway Config to enable Patch Policy**
1318

@@ -32,14 +37,13 @@ The current manifests rely on Envoy Gateway [v1.2.1](https://gateway.envoyproxy.
3237
kubectl apply -f ./manifests/ext_proc.yaml
3338
kubectl apply -f ./manifests/patch_policy.yaml
3439
```
35-
**NOTE**: Ensure the `instance-gateway-ext-proc` deployment is updated with the pod names and internal IP addresses of the vLLM replicas. This step is crucial for the correct routing of requests based on headers. This won't be needed once we make ext proc dynamically read the pods.
3640

3741
1. **Try it out**
3842

3943
Wait until the gateway is ready.
4044

4145
```bash
42-
IP=$(kubectl get gateway/llm-gateway -o jsonpath='{.status.addresses[0].value}')
46+
IP=$(kubectl get gateway/instance-gateway -o jsonpath='{.status.addresses[0].value}')
4347
PORT=8081
4448

4549
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
@@ -48,4 +52,11 @@ The current manifests rely on Envoy Gateway [v1.2.1](https://gateway.envoyproxy.
4852
"max_tokens": 100,
4953
"temperature": 0
5054
}'
51-
```
55+
```
56+
57+
58+
## Scheduling Package in Ext Proc
59+
The scheduling package implements request scheduling algorithms for load balancing requests across backend pods in an inference gateway. The scheduler ensures efficient resource utilization while maintaining low latency and prioritizing critical requests. It applies a series of filters based on metrics and heuristics to select the best pod for a given request.
60+
61+
# Flowchart
62+
<img src="../docs/schedular-flowchart.png" alt="Scheduling Algorithm" width="400" />

pkg/ext-proc/scheduling/filter.go

+18-1
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,10 @@ func leastQueuingFilterFunc(req *LLMRequest, pods []*backend.PodMetrics) ([]*bac
121121
return filtered, nil
122122
}
123123

124+
func lowQueueingPodPredicate(_ *LLMRequest, pod *backend.PodMetrics) bool {
125+
return pod.WaitingQueueSize < queueingThresholdLoRA
126+
}
127+
124128
// leastKVCacheFilterFunc finds the max and min KV cache of all pods, divides the whole range
125129
// (max-min) by the number of pods, and finds the pods that fall into the first range.
126130
// The intuition is that if there are multiple pods that share similar KV cache in the low range, we
@@ -153,12 +157,25 @@ func leastKVCacheFilterFunc(req *LLMRequest, pods []*backend.PodMetrics) ([]*bac
153157
type podPredicate func(req *LLMRequest, pod *backend.PodMetrics) bool
154158

155159
// We consider serving an adapter low cost it the adapter is active in the model server, or the
156-
// model server has room to load the adapter
160+
// model server has room to load the adapter. The lowLoRACostPredicate ensures weak affinity by spreading the
161+
// load of a LoRA adapter across multiple pods, avoiding "pinning" all requests to a single pod.
162+
// This gave good performance in our initial benchmarking results in the scenario where # of lora slots > # of lora adapters.
157163
func lowLoRACostPredicate(req *LLMRequest, pod *backend.PodMetrics) bool {
158164
_, ok := pod.ActiveModels[req.ResolvedTargetModel]
159165
return ok || len(pod.ActiveModels) < pod.MaxActiveModels
160166
}
161167

168+
// loRAAffinityPredicate is a filter function to check whether a pod has affinity to the lora requested.
169+
func loRAAffinityPredicate(req *LLMRequest, pod *backend.PodMetrics) bool {
170+
_, ok := pod.ActiveModels[req.ResolvedTargetModel]
171+
return ok
172+
}
173+
174+
// canAcceptNewLoraPredicate is a filter function to check whether a pod has room to load the adapter.
175+
func canAcceptNewLoraPredicate(req *LLMRequest, pod *backend.PodMetrics) bool {
176+
return len(pod.ActiveModels) < pod.MaxActiveModels
177+
}
178+
162179
func criticalRequestPredicate(req *LLMRequest, pod *backend.PodMetrics) bool {
163180
return req.Critical
164181
}

pkg/ext-proc/scheduling/scheduler.go

+36-6
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,11 @@ const (
1616
// TODO(https://github.com/kubernetes-sigs/llm-instance-gateway/issues/16) Make this configurable.
1717
kvCacheThreshold = 0.8
1818
// TODO(https://github.com/kubernetes-sigs/llm-instance-gateway/issues/16) Make this configurable.
19-
queueThreshold = 5
19+
queueThresholdCritical = 5
20+
// TODO(https://github.com/kubernetes-sigs/llm-instance-gateway/issues/16) Make this configurable.
21+
// the threshold for queued requests to be considered low below which we can prioritize LoRA affinity.
22+
// The value of 50 is arrived heuristicically based on experiments.
23+
queueingThresholdLoRA = 50
2024
)
2125

2226
var (
@@ -27,9 +31,8 @@ var (
2731
nextOnFailure: sheddableRequestFilter,
2832
}
2933

30-
// lowLatencyFilter tries to minimize the latency. The heuristic is to pick a server with lower
31-
// cost to load an adapter and has low KV cache, which typically yields lower latency.
32-
lowLatencyFilter = &filter{
34+
// queueLoRAAndKVCacheFilter applied least queue -> low cost lora -> least KV Cache filter
35+
queueLoRAAndKVCacheFilter = &filter{
3336
name: "least queuing",
3437
filter: leastQueuingFilterFunc,
3538
nextOnSuccessOrFailure: &filter{
@@ -42,13 +45,39 @@ var (
4245
},
4346
}
4447

48+
// queueAndKVCacheFilter applies least queue followed by least KV Cache filter
49+
queueAndKVCacheFilter = &filter{
50+
name: "least queuing",
51+
filter: leastQueuingFilterFunc,
52+
nextOnSuccessOrFailure: &filter{
53+
name: "least KV cache percent",
54+
filter: leastKVCacheFilterFunc,
55+
},
56+
}
57+
58+
lowLatencyFilter = &filter{
59+
name: "low queueing filter",
60+
filter: toFilterFunc((lowQueueingPodPredicate)),
61+
nextOnSuccess: &filter{
62+
name: "affinity LoRA",
63+
filter: toFilterFunc(loRAAffinityPredicate),
64+
nextOnSuccess: queueAndKVCacheFilter,
65+
nextOnFailure: &filter{
66+
name: "can accept LoRA Adapter",
67+
filter: toFilterFunc(canAcceptNewLoraPredicate),
68+
nextOnSuccessOrFailure: queueAndKVCacheFilter,
69+
},
70+
},
71+
nextOnFailure: queueLoRAAndKVCacheFilter,
72+
}
73+
4574
sheddableRequestFilter = &filter{
4675
// When there is at least one model server that's not queuing requests, and still has KV
4776
// cache below a certain threshold, we consider this model server has capacity to handle
4877
// a sheddable request without impacting critical requests.
4978
name: "has capacity for sheddable requests",
50-
filter: toFilterFunc(noQueueAndLessThanKVCacheThresholdPredicate(queueThreshold, kvCacheThreshold)),
51-
nextOnSuccess: lowLatencyFilter,
79+
filter: toFilterFunc(noQueueAndLessThanKVCacheThresholdPredicate(queueThresholdCritical, kvCacheThreshold)),
80+
nextOnSuccess: queueLoRAAndKVCacheFilter,
5281
// If all pods are queuing or running above the KVCache threshold, we drop the sheddable
5382
// request to make room for critical requests.
5483
nextOnFailure: &filter{
@@ -62,6 +91,7 @@ var (
6291
)
6392

6493
func NewScheduler(pmp PodMetricsProvider) *Scheduler {
94+
6595
return &Scheduler{
6696
podMetricsProvider: pmp,
6797
filter: defaultFilter,

0 commit comments

Comments
 (0)