Releases: flashinfer-ai/flashinfer
Releases Β· flashinfer-ai/flashinfer
v0.2.5
What's Changed
- Fix compilation with FP16_QK_REDUCTION enabled. by @diptorupd in #962
- misc: Use environment variable to control JIT verbose flag by @yzh119 in #981
- Triton
rms_norm
kernels by @nandor in #983 - Allow passing workspace base directory via environment variable by @jsuchome in #973
- [CHORE] Rename
output_emitted_token_num
->output_emitted_draft_token_num
by @jon-chuang in #977 - ci: switch to on-demand instances if spot instance is interrupted by @yzh119 in #987
- misc: update devcontainer by @yzh119 in #986
- ci: add torch 2.6+cu126 wheel by @yzh119 in #985
- misc: fix devcontainer conda path by @yzh119 in #989
- perf: prefetch page indices for mla kernel by @yzh119 in #991
- SM-constraint-GEMM by triton persistent kernel by @yyihuang in #982
- 3rdparty: upgrade cutlass to 3.9 by @yzh119 in #997
- perf: add
-DNDEBUG
compilation flag by @yzh119 in #998 - release: bump version to v0.2.5 by @yzh119 in #999
New Contributors
- @jsuchome made their first contribution in #973
- @jon-chuang made their first contribution in #977
- @yyihuang made their first contribution in #982
Full Changelog: v0.2.4...v0.2.5
v0.2.4
What's Changed
- typo: fix pdl terminology by @yzh119 in #933
- Fix "specutate" typo by @markmc in #934
- typo: fix target_probs docs after uniform_samples removal by @markmc in #935
- typo: remove another uniform samples leftover by @markmc in #937
- Fix/precommit issues by @diptorupd in #931
- ci: setup Jenkins by @yzh119 in #874
- bugfix: fix include header name conflict by @yzh119 in #939
- fix: Fix MLA TVM binding for the latest changes by @MasterJH5574 in #940
- feat - support mla kvcache store by @baowendin in #888
- Add POD-Attention to FlashInfer by @AKKamath in #858
- bugfix: fix potential issues of FA3 template loading nans for PageAttention by @yzh119 in #945
- fix - fix bug when not relevant seq has nan data by @baowendin in #942
- misc: add ci-badge, update blog list by @yzh119 in #948
- bugfix: Fix missing PyModuleDef field initializers by @sampan26 in #946
- fix: fix pod-attention compilation time by @yzh119 in #954
- bugfix: bugfix to #949 by @yzh119 in #951
- misc: Temporarily disable POD from AOT wheels by @abcdabcd987 in #956
- ci: improve jenkins by @yzh119 in #943
- Fix compilation on cuda 12.2 by @goliaro in #961
- doc: remove misleading docstring about
non_blocking
by @yzh119 in #966 - perf: reduce torch.library dispatch overhead by @yzh119 in #968
- [TVM] Added tvm binding for sampling kernel by @annanyapr in #958
- perf: Fix python API overhead when CUDAGraph is not enabled by @yzh119 in #969
- Fix POD JIT bugs by @AKKamath in #971
- benchmark: add sampling.renorm benchmarks by @xslingcn in #970
- perf: dual pivot top-p/top-k renorm by @xslingcn in #974
- perf: Use 2WG pipeline design for MLA implementation on Hopper by @yzh119 in #952
- release: bump version to v0.2.4 by @yzh119 in #980
New Contributors
- @markmc made their first contribution in #934
- @diptorupd made their first contribution in #931
- @AKKamath made their first contribution in #858
- @sampan26 made their first contribution in #946
- @goliaro made their first contribution in #961
- @annanyapr made their first contribution in #958
Full Changelog: v0.2.3...v0.2.4
v0.2.3
Breaking Changes
We changed the interface for sampling APIs, more specifically (see #912 ):
- The sampling API removes the
success
return value of all sampling API, which is not compatible with earlier design. - Instead of passing
uniform
tensor, we changed the sampling interface to accepttorch.Generator
(optional, https://pytorch.org/docs/stable/generated/torch.Generator.html), to align with the behavior of torch.
What's Changed
- release: bump version v0.2.2.post1 by @yzh119 in #902
- Naive Support for Hopper FP8 Prefill Kernel with Per-Head Quantization by @happierpig in #869
- bugfix: Fix no return type error by @yzh119 in #904
- ci: add dockerfile for CI by @yzh119 in #909
- ci: bugfix on release-ci-docker github action by @yzh119 in #910
- feat: flashinfer intra-kernel profiler by @yzh119 in #913
- [Package] Add tvm binding to
flashinfer.data
when packaging by @MasterJH5574 in #917 - refactor: move triton dependency to flashinfer.triton by @yzh119 in #918
- sampling: dual pivot rejection sampling algorithm to improve top-p/top-k sampling efficiency by @yzh119 in #912
- feat: support non-contiguous input/output in normalization functions by @yzh119 in #921
- feat: improve sampling algorithm robustness by @yzh119 in #923
- perf: use max probability instead of 1 as upper bound in top-p/k sampling by @yzh119 in #925
- fix: add install step of profiler's dependency by @zobinHuang in #929
- fix: undefined symbol cudaGetDriverEntryPointByVersion with CUDA >= 12.5 by @zobinHuang in #928
- feat: experimenta support of PDL by @yzh119 in #930
- release: bump version to v0.2.3 by @yzh119 in #932
New Contributors
- @happierpig made their first contribution in #869
- @zobinHuang made their first contribution in #929
Full Changelog: v0.2.2.post1...v0.2.3
v0.2.2.post1
What's Changed
- bump version to v0.2.2 by @yzh119 in #891
- perf: fix the performance of second stage of split-k by @yzh119 in #894
- fix: pin_memory use cpu as default device by @KnowingNothing in #895
- perf: tweak register amount for producer/consumer in MLA template by @yzh119 in #896
- perf: fix MLA split-k performance bug by @yzh119 in #898
- perf: use f16 as split-k partial output data type by @yzh119 in #900
- perf: tweak the pipeline design of mla kernel by @yzh119 in #901
Full Changelog: v0.2.2...v0.2.2.post1
v0.2.2
What's Changed
- fix cu121 torch2.6 by @zhyncs in #867
- unittest: add MLA test cases where kv_len is evenly divided by page_size. by @foreverlms in #861
- bugfix: fix the behavior of MLA kernel when kv-length is 0 by @yzh119 in #868
- Merge of previous PRs for typos in a single one. As per your request. by @didier-durand in #862
- add lightllm adoption by @zhyncs in #871
- fix geneate_dispatch_inc args from parser by @baowendin in #870
- [API] Fix top_k_top_p_sampling_from_logits param typo by @kasohrab in #875
- misc:Remove unused k_smem_offset_w update in MLA kernel by @muoshuosha in #878
- JIT compilation support for TVM by @MasterJH5574 in #880
- [Hotfix] Add flashinfer.jit.attention into packages by @zhouye in #881
- perf: FlashAttention-3 style MLA PageAttention by @yzh119 in #887
- [JIT] Fix MLA header in TVM binding by @MasterJH5574 in #889
- Fixing several typos in doc file kv_layout.rst by @didier-durand in #884
- unittest: add unittests for MLA + cudagraph by @yzh119 in #890
New Contributors
- @baowendin made their first contribution in #870
- @kasohrab made their first contribution in #875
- @zhouye made their first contribution in #881
Full Changelog: v0.2.1.post2...v0.2.2
v0.2.1.post2
What's Changed
- use 3 latest pytorch version by @youkaichao in #835
- docs: update installation by @zhyncs in #839
- Update README.md: fixing a typo for "hierical" by @didier-durand in #836
- Update page.rst: fixing 1 typo by @didier-durand in #841
- Update README.md: fixing 1 typo by @didier-durand in #842
- adds
TensorRT-LLM
to the list of projects adopting FlashInfer by @yzh119 in #843 - perf: MLA decode kernel implemented by CuTe targeted to SM80 by @tsu-bin in #844
- Update installation.rst: fixing 2 typos by @didier-durand in #840
- fix: Pass backend in BatchPrefillWith*KVCacheWrapper.plan() by @sfc-gh-yewang in #808
- bugfix: Fix inline RoPE in decode kernels by @MasterJH5574 in #847
- misc: Remove duplicate param set in MLA kernel by @MasterJH5574 in #850
- feat: adding
out
andlse
parameters torun
functions to allow user allocated output buffer by @yzh119 in #854 - Unique the symbol of maybe_q_rope_offset_v. by @foreverlms in #855
- typo: update
decode_maybe_q_rope_offset
by @MasterJH5574 in #856 - update ci by @zhyncs in #857
- fix some compiler pre-check. by @foreverlms in #859
- perf: dynamic split-k for MLA by @yzh119 in #863
- Revert "fix: Pass backend in BatchPrefillWith*KVCacheWrapper.plan() (β¦ by @zhyncs in #864
- chore: bump v0.2.1.post2 by @zhyncs in #865
- fix compile by @zhyncs in #866
New Contributors
- @didier-durand made their first contribution in #836
- @sfc-gh-yewang made their first contribution in #808
- @foreverlms made their first contribution in #855
Full Changelog: v0.2.1.post1...v0.2.1.post2
v0.2.1.post1
What's Changed
- doc: Fix the incorrect DeepSeek-V3 paper link by @muoshuosha in #826
- bugfix: fix the signature of
CutlassSegmentGEMMSM90
by @yzh119 in #827 - redo ci: cross python wheel by @youkaichao in #824
- bugfix: Another bugfix for torch.library by @yzh119 in #828
- misc: fix parameters name by @Chen-0210 in #817
- bugfix: update
clear_cache_dir
in JIT by @yzh119 in #829 - update release wheel by @zhyncs in #830
- chore: bump v0.2.1.post1 by @zhyncs in #831
- fix #824 by @zhyncs in #832
- fix release wheel by @zhyncs in #833
- set pip path by @zhyncs in #834
New Contributors
- @muoshuosha made their first contribution in #826
- @Chen-0210 made their first contribution in #817
Full Changelog: v0.2.1...v0.2.1.post1
v0.2.1
What's Changed
- misc: addressing the package renaming issues by @yzh119 in #770
- feat: support deepseek prefill attention shape by @yzh119 in #765
- refactor: change the structure of attention updater by @yzh119 in #772
- hotfix: follow up of #772 by @yzh119 in #773
- bugfix: Ensure Loop Termination by Enforcing IEEE-754 Compliance in Sampling Kernels by @yzh119 in #774
- bugfix: fix the JIT warmup arguments in unittests by @yzh119 in #775
- ci: change whl folder to flashinfer-python by @abcdabcd987 in #779
- perf: refactor fa2 prefill template by @yzh119 in #776
- feat: Separate QK/VO head dim dispatch for sm90 AOT by @abcdabcd987 in #778
- bugfix: fix batch prefill attention kernel unittests by @yzh119 in #781
- misc: remove head dimension 64 from AOT by @yzh119 in #782
- misc: allow head_dim=64 for sm90 AOT by @abcdabcd987 in #783
- bugfix: drop CTA_TILE_Q=32 by @abcdabcd987 in #785
- refactor: make
group_size
a part of params by @yzh119 in #786 - bugfix: MLA decode should multiply sm_scale by math::log2e by @tsu-bin in #787
- fix rope logic in mla decoding by @zhyncs in #793
- Fix arguments of
plan
for split QK/VO head dims by @abmfy in #795 - test: add unittest comparing deepseek prefill fa2 & 3 implementation by @yzh119 in #797
- bugfix: fix aot build not compatible with cmake command by @tsu-bin in #796
- Fix the type annotation of q_dtype and kv_dtype on ragged prefill by @nandor in #798
- feat: support f32 attention output in FA2 template by @yzh119 in #799
- feat: apply sm_scale at logits instead of q in FA2 template by @yzh119 in #801
- bugfix: mla decode failed under cuda graph mode, and update test case by @tsu-bin in #803
- perf: memory efficient deepseek mla fused page-attention kernel by @yzh119 in #804
- bugfix: mla page-attention kernel for different page sizes by @yzh119 in #810
- doc: add documentation to new MLA interface by @yzh119 in #811
- feat: unlocking MLA for A100 by @yzh119 in #812
- feat: cudagraph-compatible MLA API by @yzh119 in #813
- feat: unlock MLA attention for sm89 (L40/L40s/4090) by @yzh119 in #814
- misc: fix sphinx by @abcdabcd987 in #815
- bugfix: fix the behavior of mla plan function when provided with host tensors by @yzh119 in #816
- doc: improve mla related documentation by @yzh119 in #818
New Contributors
Full Changelog: v0.2.0.post2...v0.2.1
What's Changed
- misc: addressing the package renaming issues by @yzh119 in #770
- feat: support deepseek prefill attention shape by @yzh119 in #765
- refactor: change the structure of attention updater by @yzh119 in #772
- hotfix: follow up of #772 by @yzh119 in #773
- bugfix: Ensure Loop Termination by Enforcing IEEE-754 Compliance in Sampling Kernels by @yzh119 in #774
- bugfix: fix the JIT warmup arguments in unittests by @yzh119 in #775
- ci: change whl folder to flashinfer-python by @abcdabcd987 in #779
- perf: refactor fa2 prefill template by @yzh119 in #776
- feat: Separate QK/VO head dim dispatch for sm90 AOT by @abcdabcd987 in #778
- bugfix: fix batch prefill attention kernel unittests by @yzh119 in #781
- misc: remove head dimension 64 from AOT by @yzh119 in #782
- misc: allow head_dim=64 for sm90 AOT by @abcdabcd987 in #783
- bugfix: drop CTA_TILE_Q=32 by @abcdabcd987 in #785
- refactor: make
group_size
a part of params by @yzh119 in #786 - bugfix: MLA decode should multiply sm_scale by math::log2e by @tsu-bin in #787
- fix rope logic in mla decoding by @zhyncs in #793
- Fix arguments of
plan
for split QK/VO head dims by @abmfy in #795 - test: add unittest comparing deepseek prefill fa2 & 3 implementation by @yzh119 in #797
- bugfix: fix aot build not compatible with cmake command by @tsu-bin in #796
- Fix the type annotation of q_dtype and kv_dtype on ragged prefill by @nandor in #798
- feat: support f32 attention output in FA2 template by @yzh119 in #799
- feat: apply sm_scale at logits instead of q in FA2 template by @yzh119 in #801
- bugfix: mla decode failed under cuda graph mode, and update test case by @tsu-bin in #803
- perf: memory efficient deepseek mla fused page-attention kernel by @yzh119 in #804
- bugfix: mla page-attention kernel for different page sizes by @yzh119 in #810
- doc: add documentation to new MLA interface by @yzh119 in #811
- feat: unlocking MLA for A100 by @yzh119 in #812
- feat: cudagraph-compatible MLA API by @yzh119 in #813
- feat: unlock MLA attention for sm89 (L40/L40s/4090) by @yzh119 in #814
- misc: fix sphinx by @abcdabcd987 in #815
- bugfix: fix the behavior of mla plan function when provided with host tensors by @yzh119 in #816
- doc: improve mla related documentation by @yzh119 in #818
- release: bump version to v0.2.1 by @yzh119 in #819
- refactor: change to TORCH_LIBRARY by @youkaichao in #764
- Revert "refactor: change to TORCH_LIBRARY" by @yzh119 in #820
- bugfix: bugfix on sm89 MLA by @yzh119 in #821
- hotfix: bugfix on #812 by @yzh119 in #822
- refactor: change to TORCH_LIBRARY by @abmfy in #823
New Contributors
Full Changelog: v0.2.0.post2...v0.2.1
v0.2.0.post2
What's Changed
- ci: fix the update_whl_index script to regonize version number with "post" and add torch2.5 by @yzh119 in #694
- bugfix: casting int array to int32 for rope input arguments by @yzh119 in #697
- bugfix: only use sm90 group gemm when torch cuda >= 12.3 by @yzh119 in #699
- misc: remove release-please workflow by @yzh119 in #705
- Customizable SM90 prefill kernels. by @hyhieu in #704
- hotfix: revert torch.library register by @yzh119 in #709
- Improve compatibility with pytorch 2.5 by @zifeitong in #711
- misc: add bibtex reference by @yzh119 in #712
- sampling: simplify min-p sampling by @yzh119 in #713
- perf: fix the iteration bound of SWA in FA2 prefill template by @yzh119 in #714
- bugfix: fix min-p AOT compilation in #713 by @yzh119 in #717
- Triton implementation of
silu_and_mul
by @nandor in #716 - bugfix: FusedAddRMSNorm kernels might require more than 48KB shared memory when d is large. by @bobboli in #718
- bugfix: Choose sm90 kernels only for Hopper GPUs. by @bobboli in #719
- Finer-grained control over fp16/fp8 builds by @nandor in #722
- Align KV chunk size binary search with actual KV chunk splitting. by @timzsu in #728
- ci: rename python package name to
flashinfer-python
by @yzh119 in #729 - Add a note about int32/int64 datatypes to the
kv_layout
tutorial by @fergusfinn in #737 - fix return type of cuBLAS by @zhyncs in #749
- [Refactor] Unify JIT/Customization/AOT mode by @yzh119 in #748
- Move allocations out of torch ops by @nandor in #740
- [Lint] Fix some linting issues and provide automatic format check script by @LeiWang1999 in #743
- Filter out unsupported head dim for sm90 by @abcdabcd987 in #751
- bugfix: various AOT issues by @abcdabcd987 in #752
- [bugfix] Fix cpp tests/benchmarks by @yzh119 in #753
- fix pin memory device by @youkaichao in #755
- Add dev container for easier development by @ByronHsu in #680
- hotfix: bugfix to #756 by @yzh119 in #757
- Change
apply_rope_with_cos_sin_cache
to acceptcos_sin_cache
by @ByronHsu in #754 - fix: match statement not supported in Python 3.8 by @xslingcn in #759
- bugfix: use actual sm count for num_sm90_ctas by @LLLLKKKK in #762
- bugfix: Fix block-sparse attention API by @yzh119 in #767
- Version bump: v0.2.0.post2 by @yzh119 in #768
New Contributors
- @hyhieu made their first contribution in #704
- @zifeitong made their first contribution in #711
- @bobboli made their first contribution in #718
- @timzsu made their first contribution in #728
- @fergusfinn made their first contribution in #737
- @LeiWang1999 made their first contribution in #743
- @youkaichao made their first contribution in #755
- @LLLLKKKK made their first contribution in #762
Full Changelog: v0.2.0.post1...v0.2.0.post2