2025/03/10/sampling #4

utterances-bot · 2025-03-11T05:35:33Z

Sorting-Free GPU Kernels for LLM Sampling | FlashInfer

Background

https://flashinfer.ai/2025/03/10/sampling.html

platypus1989 · 2025-03-29T16:28:24Z

Thanks a lot for the great introduction, this is super helpful！ quick question about the curve of sampling latency scaling with batch size, wondering why is there a bump of latency from around 130 to 140 for PyTorch.

yzh119 · 2025-04-04T01:02:29Z

Hi @platypus1989 it might because of a change of kernel choice (e.g. grid size bump from 128 to 256) for different batch sizes. Maybe @xslingcn can provide the raw trace file and check the kernel configuration under different batch size.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2025/03/10/sampling #4

2025/03/10/sampling #4

utterances-bot commented Mar 11, 2025 •

edited by yzh119

Loading

platypus1989 commented Mar 29, 2025

yzh119 commented Apr 4, 2025

2025/03/10/sampling #4

2025/03/10/sampling #4

Comments

utterances-bot commented Mar 11, 2025 • edited by yzh119 Loading

Sorting-Free GPU Kernels for LLM Sampling | FlashInfer

platypus1989 commented Mar 29, 2025

yzh119 commented Apr 4, 2025

utterances-bot commented Mar 11, 2025 •

edited by yzh119

Loading