Effective Model Server State Probing in EPP #678

liu-cong · 2025-04-10T23:45:26Z

liu-cong
Apr 10, 2025

Background

A solid understanding of the runtime state of the model servers (mainly load information) is the prerequisite for making intelligent request scheduling decisions to optimize for different goals. Today EPP periodically scrapes model server load metrics (referred asyn probing method below)to implement LoRA affinity and latency optimized scheduling algorithms.

Here I discuss various solutions to get a fresh state of the model servers.

Solution 1: Model Server State Reporting

State Reporting refers to the method of having the model server report the actual state (mainly load).

1. Async probing

The async probing method is straightforward to implement and very effective in many scenarios. However its effectiveness highly depends on the probing frequency. For example, if a fresh pool receives a traffic spike, LoRA affinity is less effective because EPP may have spread the adapters before getting the next refresh of model server metrics to “pin” the adapters to servers.

Pros:

Easy to implement
Tunable probing interval to trade off between accuracy and cost

Cons:

Highly dependent on probing frequency. While increasing frequency increases accuracy, it incurs more network cost, and there will be eventually a limit.

2. ORCA response header

ORCA defines a protocol to report load information in response headers. This is a low cost way to collect load information. However, it works best in high QPS and low latency scenarios.

Pros:

Low cost

Cons:

Limited to high QPS and low latency requests

3. Synchronous probing

EPP may do a synchronous probing to (a subset of) model servers to get more up to date information, provided that the probing endpoint is fast.

Pros:

More up to date information than async probing
Less probing overhead at low QPS (however this can become a con in high QPS)

Cons:

Adds probing latency to the request path

Solution 2: Approximation/Prediction

EPP can approximate or predict model server state. The proposed prefix affinity routing basically approximates prefix caching state in model servers, by tracking request history.

Pros:

No probing cost
Ability to react based on predicated future state

Cons:

Usually need to maintain a state
Usually not 100% accurate
Accuracy drops if model server gets traffic outside of EPP

Solution 3: Reporting + Approximation/Prediction

We can combine the async probing result with prediction. For example, we can use the prediction method for LoRA affinity, if the model server's state is fresh (don’t have any adapters loaded yet). And then use the reported adapter state once that’s available. Similarly with prefix cache, we can have model servers report their currently cached indexes, and maintain a very small approximate cache (e.g, of last 100ms, 2X of probing interval) on the EPP to compensate for the loss of accuracy due to probing interval.

Pros:

Provides the best accuracy

Cons:

Combining 2 sources of states can be complex (e.g., conflict resolution)

Recommendation

Async probing is the mostly widely used method currently. Tune the probing interval to meet your traffic needs. Generally you need lower interval for high QPS per model server.
Consider making the probing interval adaptive based on QPS data.
Consider Solution 3 if none of the above solves the problem.

liu-cong · 2025-04-10T23:46:19Z

liu-cong
Apr 10, 2025
Author

cc @ahg-g We discussed this earlier. I put together my thoughts.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Effective Model Server State Probing in EPP #678

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Effective Model Server State Probing in EPP #678

liu-cong Apr 10, 2025

Background

Solution 1: Model Server State Reporting

1. Async probing

2. ORCA response header

3. Synchronous probing

Solution 2: Approximation/Prediction

Solution 3: Reporting + Approximation/Prediction

Recommendation

Replies: 1 comment

liu-cong Apr 10, 2025 Author

liu-cong
Apr 10, 2025

liu-cong
Apr 10, 2025
Author