Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2025/03/10/sampling #4

Open
utterances-bot opened this issue Mar 11, 2025 · 2 comments
Open

2025/03/10/sampling #4

utterances-bot opened this issue Mar 11, 2025 · 2 comments

Comments

@utterances-bot
Copy link

utterances-bot commented Mar 11, 2025

Sorting-Free GPU Kernels for LLM Sampling | FlashInfer

Background

https://flashinfer.ai/2025/03/10/sampling.html

Copy link

Thanks a lot for the great introduction, this is super helpful! quick question about the curve of sampling latency scaling with batch size, wondering why is there a bump of latency from around 130 to 140 for PyTorch.

@yzh119
Copy link
Collaborator

yzh119 commented Apr 4, 2025

Hi @platypus1989 it might because of a change of kernel choice (e.g. grid size bump from 128 to 256) for different batch sizes. Maybe @xslingcn can provide the raw trace file and check the kernel configuration under different batch size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants