You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This PR refactors the FA2-based prefill template, including the
following changes:
1. Using KernelTraits for all constexpr and data types.
2. Using SharedStorage class for a clean interface shared memory
management.
3. Unlock `CTA_TILE_Q=32`.
We also tried `CTA_TILE_Q=8`, the half-mma optimization for GQA decoding
with low group ratio (<=8), however, the performance improvement is very
marginal (<1%) and make codebase complicated and thus we didn't
incorporate this feature in the PR.
0 commit comments