Skip to content

Revisiting the Latency Objective #69

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #73
ahg-g opened this issue Dec 6, 2024 · 1 comment
Closed
Tracked by #73

Revisiting the Latency Objective #69

ahg-g opened this issue Dec 6, 2024 · 1 comment
Assignees
Milestone

Comments

@ahg-g
Copy link
Contributor

ahg-g commented Dec 6, 2024

The Problem

As of writing this document, the inference gateway API proposes to allow users to explicitly set a latency objective named DesiredAveragePerOutputTokenLatencyAtP95OverMultipleRequests. This is meant to capture the user’s high-level serving intent. However, other than the absurdly long name, which we obviously can change, committing to a specific latency objective is challenging:
Its semantics may be confusing to users, and so some may think it is an SLO that the inference gateway can defend, which is not true.

We use it currently to determine if the request is critical vs sheddable; however, is this really the right way to define priority? This further emphasizes the point of the confusing semantics of this parameter.
The routing algorithm can provide significant efficiency wins without needing to know this intent, it can route based on server load metrics (kv-cache utilization), LoRA affinity, bucketization and queueing.

Proposal

I propose to drop the latency objective for now and add a Criticality parameter instead to explicitly capture priority, we can start with two values: Critical and Sheddable. The semantics for this parameter are easier to reason about by the user: use Critical for production workloads and Sheddable for test/dev workloads.

This leaves the door open to adding the latency objective in the future to Critical workloads.

@ahg-g
Copy link
Contributor Author

ahg-g commented Dec 6, 2024

/assign @ahg-g

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants