Replies: 1 comment
-
cc @ahg-g We discussed this earlier. I put together my thoughts. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Background
A solid understanding of the runtime state of the model servers (mainly load information) is the prerequisite for making intelligent request scheduling decisions to optimize for different goals. Today EPP periodically scrapes model server load metrics (referred asyn probing method below)to implement LoRA affinity and latency optimized scheduling algorithms.
Here I discuss various solutions to get a fresh state of the model servers.
Solution 1: Model Server State Reporting
State Reporting refers to the method of having the model server report the actual state (mainly load).
1. Async probing
The async probing method is straightforward to implement and very effective in many scenarios. However its effectiveness highly depends on the probing frequency. For example, if a fresh pool receives a traffic spike, LoRA affinity is less effective because EPP may have spread the adapters before getting the next refresh of model server metrics to “pin” the adapters to servers.
Pros:
Cons:
2. ORCA response header
ORCA defines a protocol to report load information in response headers. This is a low cost way to collect load information. However, it works best in high QPS and low latency scenarios.
Pros:
Cons:
3. Synchronous probing
EPP may do a synchronous probing to (a subset of) model servers to get more up to date information, provided that the probing endpoint is fast.
Pros:
Cons:
Solution 2: Approximation/Prediction
EPP can approximate or predict model server state. The proposed prefix affinity routing basically approximates prefix caching state in model servers, by tracking request history.
Pros:
Cons:
Solution 3: Reporting + Approximation/Prediction
We can combine the async probing result with prediction. For example, we can use the prediction method for LoRA affinity, if the model server's state is fresh (don’t have any adapters loaded yet). And then use the reported adapter state once that’s available. Similarly with prefix cache, we can have model servers report their currently cached indexes, and maintain a very small approximate cache (e.g, of last 100ms, 2X of probing interval) on the EPP to compensate for the loss of accuracy due to probing interval.
Pros:
Cons:
Recommendation
Beta Was this translation helpful? Give feedback.
All reactions