Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: refactor fa2 prefill template #776

Merged
merged 12 commits into from
Feb 4, 2025
Merged

perf: refactor fa2 prefill template #776

merged 12 commits into from
Feb 4, 2025

Conversation

yzh119
Copy link
Collaborator

@yzh119 yzh119 commented Feb 3, 2025

This PR refactors the FA2-based prefill template, including the following changes:

  1. Using KernelTraits for all constexpr and data types.
  2. Using SharedStorage class for a clean interface shared memory management.
  3. Unlock CTA_TILE_Q=32.

We also tried CTA_TILE_Q=8, the half-mma optimization for GQA decoding with low group ratio (<=8), however, the performance improvement is very marginal (<1%) and make codebase complicated and thus we didn't incorporate this feature in the PR.

@yzh119 yzh119 changed the title perf: refactor fa2 template to unlock half-mma perf: refactor fa2 prefill template Feb 4, 2025
@yzh119 yzh119 merged commit fc03772 into main Feb 4, 2025
@zhyncs zhyncs deleted the half-mma branch February 4, 2025 09:14
abcdabcd987 added a commit to abcdabcd987/flashinfer that referenced this pull request Feb 4, 2025
yzh119 pushed a commit that referenced this pull request Feb 4, 2025
#776 added CTA_TILE_Q=32 but it produces incorrect result.
yzh119 added a commit that referenced this pull request Feb 5, 2025
We put `group_size` outside of params mainly because we observe better
performance, but with some recent refactor such as #748 and #776 , there
is no need to decouple group_size with other parts of the parameters,
this PR merge `group_size` back to parameter class.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant