Revisiting the Latency Objective #69

ahg-g · 2024-12-06T16:40:59Z

The Problem

As of writing this document, the inference gateway API proposes to allow users to explicitly set a latency objective named DesiredAveragePerOutputTokenLatencyAtP95OverMultipleRequests. This is meant to capture the user’s high-level serving intent. However, other than the absurdly long name, which we obviously can change, committing to a specific latency objective is challenging:
Its semantics may be confusing to users, and so some may think it is an SLO that the inference gateway can defend, which is not true.

We use it currently to determine if the request is critical vs sheddable; however, is this really the right way to define priority? This further emphasizes the point of the confusing semantics of this parameter.
The routing algorithm can provide significant efficiency wins without needing to know this intent, it can route based on server load metrics (kv-cache utilization), LoRA affinity, bucketization and queueing.

Proposal

I propose to drop the latency objective for now and add a Criticality parameter instead to explicitly capture priority, we can start with two values: Critical and Sheddable. The semantics for this parameter are easier to reason about by the user: use Critical for production workloads and Sheddable for test/dev workloads.

This leaves the door open to adding the latency objective in the future to Critical workloads.

ahg-g · 2024-12-06T17:26:49Z

/assign @ahg-g

ahg-g mentioned this issue Dec 6, 2024

Tracking issue for release 0.1.0 #73

Closed

k8s-ci-robot assigned ahg-g Dec 6, 2024

ahg-g mentioned this issue Dec 10, 2024

Proposal update for the API names and latency objective #91

Merged

kfswain mentioned this issue Dec 11, 2024

API Shift/Refactor #93

Merged

ahg-g closed this as completed Dec 19, 2024

danehans added this to the v0.1.0-rc.1 milestone Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisiting the Latency Objective #69

Revisiting the Latency Objective #69

ahg-g commented Dec 6, 2024

ahg-g commented Dec 6, 2024

Revisiting the Latency Objective #69

Revisiting the Latency Objective #69

Comments

ahg-g commented Dec 6, 2024

The Problem

Proposal

ahg-g commented Dec 6, 2024