You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As of writing this document, the inference gateway API proposes to allow users to explicitly set a latency objective named DesiredAveragePerOutputTokenLatencyAtP95OverMultipleRequests. This is meant to capture the user’s high-level serving intent. However, other than the absurdly long name, which we obviously can change, committing to a specific latency objective is challenging:
Its semantics may be confusing to users, and so some may think it is an SLO that the inference gateway can defend, which is not true.
We use it currently to determine if the request is critical vs sheddable; however, is this really the right way to define priority? This further emphasizes the point of the confusing semantics of this parameter.
The routing algorithm can provide significant efficiency wins without needing to know this intent, it can route based on server load metrics (kv-cache utilization), LoRA affinity, bucketization and queueing.
Proposal
I propose to drop the latency objective for now and add a Criticality parameter instead to explicitly capture priority, we can start with two values: Critical and Sheddable. The semantics for this parameter are easier to reason about by the user: use Critical for production workloads and Sheddable for test/dev workloads.
This leaves the door open to adding the latency objective in the future to Critical workloads.
The text was updated successfully, but these errors were encountered:
The Problem
As of writing this document, the inference gateway API proposes to allow users to explicitly set a latency objective named DesiredAveragePerOutputTokenLatencyAtP95OverMultipleRequests. This is meant to capture the user’s high-level serving intent. However, other than the absurdly long name, which we obviously can change, committing to a specific latency objective is challenging:
Its semantics may be confusing to users, and so some may think it is an SLO that the inference gateway can defend, which is not true.
We use it currently to determine if the request is critical vs sheddable; however, is this really the right way to define priority? This further emphasizes the point of the confusing semantics of this parameter.
The routing algorithm can provide significant efficiency wins without needing to know this intent, it can route based on server load metrics (kv-cache utilization), LoRA affinity, bucketization and queueing.
Proposal
I propose to drop the latency objective for now and add a Criticality parameter instead to explicitly capture priority, we can start with two values: Critical and Sheddable. The semantics for this parameter are easier to reason about by the user: use Critical for production workloads and Sheddable for test/dev workloads.
This leaves the door open to adding the latency objective in the future to Critical workloads.
The text was updated successfully, but these errors were encountered: