Skip to content

Commit e121718

Browse files
committed
proposed repo structure + copy of initial proposal
1 parent 37a8575 commit e121718

File tree

7 files changed

+211
-0
lines changed

7 files changed

+211
-0
lines changed

Diff for: docs/charter.md

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
TODO: describe LLM Instance Gateway charter. And determine if this repo is the best place for this charter
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,209 @@
1+
2+
# LLM Instance Gateway
3+
<!-- toc -->
4+
5+
- [Summary](#summary)
6+
- [Motivation](#motivation)
7+
- [Goals](#goals)
8+
- [Non-Goals](#non-goals)
9+
- [Proposal](#proposal)
10+
- [Gateway](#gateway)
11+
- [CRDs](#crds)
12+
- [Envoy
13+
Solution](#envoy-solution)
14+
- [Model Server Protocol](#model-server-protocol)
15+
- [PoC Design Details](#poc-design-details)
16+
- [Overview](#overview)
17+
- [Request Flow](#request-flow)
18+
- [Pod selection algorithm in PoC](#pod-selection-algorithm-in-poc)
19+
- [Artifacts](#artifacts) <!-- /toc -->
20+
21+
## Summary
22+
23+
As presented in the [demo](https://youtu.be/NUBZg_uqqXk?si=v681EeYdGUGEVqQQ&t=1458) and building further upon the [joint proposal](https://docs.google.com/document/d/1BkwDlgFxSKKPHhM9kS28CdDIyJ3Xkdue3Iw1INaUkGw/edit?tab=t.0#heading=h.ajlsibmfh8wr), we are proposing that a gateway, focused on
24+
multiplexing
25+
use cases upon shared hardware has distinct advantages in enabling efficient and fair use of multiple use-cases over a shared pool of compute.
26+
27+
## Motivation
28+
29+
Novel advancements in fine-tuning like [LoRA](https://arxiv.org/abs/2106.09685) and [Multi-LoRA](https://arxiv.org/abs/2310.18547) have enabled multiple distinct use cases to share accelerators. As this new tech is adopted, the Day1/2 operational concerns quickly become necessary.
30+
31+
Kubernetes as long been a standard in easing and automating operational tasks of
32+
workloads. A mechanism (gateway) within the K8s ecosystem is a
33+
reasonable, and expected way for a user to support multiple LLM use cases on shared
34+
accelerators.
35+
36+
### Goals
37+
38+
#### Proposal Goals
39+
40+
- Create an Inference Gateway project group for wg-serving collaboration,
41+
including: chat channel & dedicated repo (sponsored by sig-network)
42+
43+
#### Gateway Goals
44+
45+
- Fast reconfiguration - New use cases (including LoRA adapters or client
46+
configuration) can be rolled out / back in seconds to clients without waiting for
47+
a new model server to start.
48+
- Efficient accelerator sharing - Use cases can use less than an accelerator
49+
or temporarily burst without needing to start a new model server leading to
50+
fewer wasted accelerators and better pooling of shared capacity.
51+
- Operational resilience - Use cases share available accelerators fairly and
52+
can have distinct priorities, latency objectives, and failure policies.
53+
- Standardized LoRA - Simple recommended patterns for deploying and loading
54+
LoRA adapters on a wide range of Kubernetes environments into model servers.
55+
- Composability - Approach should be composable with:
56+
- K8s Gateway API
57+
- Other gateway features and projects, including high level LLM gateways
58+
- existing deployment tools like kserve or kaito
59+
- different model servers
60+
61+
### Non-Goals
62+
63+
64+
#### Proposal Non-Goals
65+
- Creation of a fully realized KEP
66+
67+
#### Gateway Non-Goals
68+
69+
- Replacing the features of pre-existing Gateways
70+
- Defining how serving workloads must be deployed
71+
72+
## Proposal
73+
74+
### Gateway
75+
76+
#### CRD(s)
77+
78+
To adequately achieve the above goals, we propose the addition of 1 or more CRDs
79+
to express:
80+
81+
- The boundaries of a compute pool that shares a base model
82+
- Including the deployment of a routing solution (PoC details below)
83+
- A specific use case upon one or more backend pools
84+
- The objectives that this use case needs to achieve
85+
86+
The example API we showed in our demo looked like:
87+
88+
```
89+
kind: LLMRoute
90+
apiVersion: inference.x-k8s.io/v1alpha1
91+
metadata:
92+
name: assistant
93+
spec:
94+
parentRefs:
95+
- name: ai-gw
96+
backendRefs:
97+
- name: assistant
98+
adapter:
99+
name: sentiment
100+
priority: 100
101+
objectives:
102+
- type: OutputTokenLatency
103+
104+
latency:
105+
value: 2s
106+
quantile:
107+
numerator: 99
108+
metrics:
109+
110+
format: Prometheus
111+
```
112+
113+
#### Envoy Solution
114+
115+
Any gateway solution *must* be compatible with Envoy Proxy, and have a plan with
116+
how to integrate these features into the Envoy ecosystem over the long term.
117+
118+
#### Model Server Protocol
119+
120+
In the PoC investigation we discovered the need for certain control and data to
121+
be exposed by the model server. In order for a model server to work properly
122+
with this LLM Instance Gateway, the model server would need to implement this
123+
protocol.
124+
125+
Key requirements would roughly look like:
126+
- A method, or set of methods to dynamically update the available LoRA catalog on a model server
127+
- Metrics, shared as a header on response data, or some other similar mechanism, for data like:
128+
- Networking-friendly metric share (shared as a header, or other
129+
lightweight mechanism, just not in the body)
130+
- Adapter State
131+
- Available catalog
132+
- Queue data (per adapter)
133+
134+
135+
## PoC Design
136+
137+
From the proof of concept we believe the following architecture is a starting point for this proposal:
138+
139+
- Envoy Proxy
140+
- An OSS starting point that is generally accepted and used
141+
- Ext proc
142+
- A necessary tool to extend the capabilities of Envoy to allow for routing based on the Open AI model field (within the body)
143+
- An agile tool for development of novel LLM Instance Gateway features
144+
- CRD/K8s API interface
145+
- Model server modifications
146+
- Necessary to extend existing tooling to provide the proper routing data to Envoy
147+
- Potentially extend further to support [ORCA](https://github.com/envoyproxy/envoy/issues/6614) headers as a method of metrics transfer
148+
149+
### Overview
150+
151+
Our very high level diagram of how this looked:
152+
![high level design](./images/high_level_design.png)
153+
154+
To briefly describe how the components work together:
155+
156+
- When an `LLMRoute` is defined, our gateway recognizes this new service, and
157+
allows traffic for the specified adapter to be admitted to the backend pool.
158+
- We support and expect Open AI API spec as the default when reading the
159+
adapter.
160+
161+
- Incoming traffic for a validated service is then routed to ExtProc, where
162+
routing and fairness decisions are made.
163+
164+
- We attempt to route to a model server that has the adapter already loaded,
165+
so long as there is batch capacity
166+
167+
168+
### Request Flow
169+
170+
Below is an example of a
171+
life of a request using this described design:
172+
![request flow](./images/flow_diagram.png)
173+
174+
> Notes:
175+
>
176+
> 1. Ext Proc: External processing calls an external gRPC service to
177+
> process HTTP requests and responses
178+
>
179+
> 2. Original Dst: Original destination
180+
> cluster can be used when incoming connections are redirected to Envoy either
181+
> via an iptables REDIRECT or TPROXY target or with Proxy Protocol. In these
182+
> cases requests routed to an original destination cluster are forwarded to
183+
> upstream hosts as addressed by the redirection metadata, without any explicit
184+
> host configuration or upstream host discovery. We implemented this using the
185+
> bootstrap feature of Envoy Gateway
186+
187+
### Pod selection algorithm in PoC
188+
189+
Metrics stored in Ext Proc Cache:
190+
- Active adapters in Each pod
191+
- Number of pending requests in each adapter in each pod.
192+
193+
Given a request, read the relevant metrics from the cache and find which pods have that lora adapter loaded.
194+
Out of the set of pods that have the lora adapter loaded and the number of pending requests in that adapter is below a threshold, pick the one with the
195+
most amount of pending requests (we pick the most to prevent flopping).
196+
- If no pods satisfy 1 or 2 then pick a pod with: (in following priority):
197+
1. Least number of active adapters.
198+
1. Least total pending requests
199+
200+
### Artifacts:
201+
202+
- [Ext-proc/Envoy/Benchmarking repo](https://github.com/tomatillo-and-multiverse/lora-inference-gateway)
203+
- Repo we used to develop the ext proc image used in the PoC
204+
- Also contains the manifests required to deploy gateway
205+
- [vLLM fork](https://github.com/kaushikmitr/vllm)
206+
- Presentation:
207+
- [Slides](https://docs.google.com/presentation/d/1I1XDf6fQQEtHxJtZxFdIaUcUA3lLBC7neW823diWS78/edit?usp=sharing)
208+
- [Recording](https://youtu.be/NUBZg_uqqXk?si=v681EeYdGUGEVqQQ&t=1458)
209+
- [PoC Design & Experimentation data](https://docs.google.com/document/d/17wB0BgeV8JrGtccxZqkOqFyNC4gPBNqdKg8Oe9xMkio/edit#heading=h.eeeqp85g68qy)
146 KB
Loading
Loading

Diff for: docs/protocols/model-server/protocol-desc.md

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
TO-DO: describe the model server protocol here that should be implemented if a model server would like to integrate with LLM Instance Gateway

Diff for: examples/poc/README.md

Whitespace-only changes.

Diff for: examples/poc/ext-proc/go-code.go

Whitespace-only changes.

0 commit comments

Comments
 (0)