[RFC]: TPU V1 Sampler planning #16268

NickLucche · 2025-04-08T14:24:38Z

Motivation.

I'd like to gather some input on how to move forward with sampling support, and also provide a brief recap of the current state+planned support.

At a high level, the current design splits model forward and sampling into two separate graphs.
As of now (f2ebb6f54) only the temperature and min_p have been intentionally enabled.
As more techniques will be added, the sampling graph will grow in size (vertically, sequential ops) and performance may need monitoring, as we're simply evaluating more operations at runtime.
To clarify, even when one option is not enabled, we still evaluate a no-op version that undergoes the same ops in the graph (eg top-p with p=1).

Proposed Change.

Following #15489 a few concerns that have been raised regarding performance while enabling topk, hence adding the very first op to the initial sampling graph, I'd like to re-evaluate the current approach.
Looking at the opposite side of the spectrum one could ideally provide a sampling graph for each combination of parameters.
While this is unfeasible due to the number of parameters that sampling needs to support, one approach "in the middle" includes pre-compiling a set of common sampling params while routing requests to the "correct" one.
The main issue I see here is batching, as every request may potentially specify different sampling params, either we identify the superset for the current batch and then route to the corresponding graph, or each request is executed on a separate graph, which I believe would hurt performance even more. With that said, I still think most request will fall into the temperature only "bucket", followed by the topk/topp one, so one could implement the most popular routes. I have no production data to back this assertion with though, so don't quote me on that.

I think this is the main point to clarify before moving on and expand the number of supported parameters.

Please note the above is based on the assumption that latency is indeed going up. To clear out any such doubts, I think the PR #16022 will go a long way to allow easy benchmarking of sampling parameters.

Moving forward, I've compiled a list of parameters to support along with the effort needed to implement + my own suggestions.
We can also use it to track progress.

Feedback Period.

No response

CC List.

@robertgshaw2-redhat @yaochengji @alexm-redhat @mgoin @bvrockwell @hyeygit @lsy323

Any Other Things.

UPDATE: this is very much related to #13360.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

yaochengji · 2025-04-08T16:43:51Z

Thanks @NickLucche for the RFC!

cc @Chenyaaang who is working on TPU structured decoding.

yaochengji · 2025-04-08T23:38:52Z

we need to compile for different max_num_logprobs values

Is simple max padding enough for this case?

yaochengji · 2025-04-08T23:42:00Z

each request is executed on a separate graph

Each request executed might not hurt that much, if there's no recompilation.

NickLucche · 2025-04-09T08:41:17Z

| Is simple max padding enough for this case?

For returning the prob of the sampled token yes. But this is more about gathering the topK logprobs. Either we do a topk with some fixed/maximum K on TPU to cut down the vocab dimension and then select the actual K requested on CPU, or we compile multiple Ks.
I am fine with the former option.

| Each request executed might not hurt that much, if there's no recompilation.

I think we pay the overhead of just lunching hundreds (num_seqs) of graphs + gathering results + we lose arithm intensity on all those ops that execute on BxV shapes (now B times 1xV). Although this last point might be ~invalid if dispatching is optimized on TPU.
Would it be still viable?

yaochengji · 2025-04-09T19:52:48Z

I think we pay the overhead of just lunching hundreds (num_seqs) of graphs + gathering results + we lose arithm intensity on all those ops

It depends. We have a benchmark on the performance of transformer-like model by executing it op by op. And we can still get a 40% perf compared with executing it as a entire graph. As long as the sampler is not a performance bottleneck, flexibility is more important than performance.

BxV

Why there're two dimensions? I think the dimension we might iterate over should be the num_reqs dimension.

NickLucche · 2025-04-10T07:38:04Z

Why there're two dimensions? I think the dimension we might iterate over should be the num_reqs dimension.

Yes bad naming on my side sorry. Currently we have a single num_reqsxV tensor. We would need to iterate over the first dim and launch num_reqs graphs.

NickLucche added the RFC label Apr 8, 2025

NickLucche mentioned this issue Apr 18, 2025

[TPU][V1] Enable Top-P #16843

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: TPU V1 Sampler planning #16268

[RFC]: TPU V1 Sampler planning #16268

NickLucche commented Apr 8, 2025 •

edited

Loading

yaochengji commented Apr 8, 2025

yaochengji commented Apr 8, 2025

yaochengji commented Apr 8, 2025

NickLucche commented Apr 9, 2025

yaochengji commented Apr 9, 2025 •

edited

Loading

NickLucche commented Apr 10, 2025

[RFC]: TPU V1 Sampler planning #16268

[RFC]: TPU V1 Sampler planning #16268

Comments

NickLucche commented Apr 8, 2025 • edited Loading

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

yaochengji commented Apr 8, 2025

yaochengji commented Apr 8, 2025

yaochengji commented Apr 8, 2025

NickLucche commented Apr 9, 2025

yaochengji commented Apr 9, 2025 • edited Loading

NickLucche commented Apr 10, 2025

NickLucche commented Apr 8, 2025 •

edited

Loading

yaochengji commented Apr 9, 2025 •

edited

Loading