You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
scheduling changes for lora affinity load balancing (#423)
* scheduling changes for lora affinity load balancing
* refactor unit tests, address comments
* restore vllm deployment manifest
* update README for model server protocol to add waiting lora adapters
* remove unused variables
* removed unused func
* fix model protocol readme
* fix hermetic test for select active lora, low queue
* update comment in metrics.go in vllm backend
* add filter test TestLoRASoftAffinityDistribution
* restore vllm manifest
* update unit test
Copy file name to clipboardexpand all lines: pkg/epp/scheduling/scheduler.go
+9-11
Original file line number
Diff line number
Diff line change
@@ -36,8 +36,11 @@ const (
36
36
queueThresholdCritical=5
37
37
// TODO(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/16) Make this configurable.
38
38
// the threshold for queued requests to be considered low below which we can prioritize LoRA affinity.
39
-
// The value of 50 is arrived heuristicically based on experiments.
40
-
queueingThresholdLoRA=50
39
+
// The value of 128 is arrived heuristicically based on experiments.
40
+
queueingThresholdLoRA=128
41
+
// TODO(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/16) Make this configurable.
42
+
// loraAffinityThreshold indicates the probability with which we prefer a pod with LoRA affinity over a pod without but having room to fit more LoRA adapters.
0 commit comments