Skip to content

Releases: flashinfer-ai/flashinfer

v0.2.5

04 Apr 00:41
592b110
Compare
Choose a tag to compare

What's Changed

  • Fix compilation with FP16_QK_REDUCTION enabled. by @diptorupd in #962
  • misc: Use environment variable to control JIT verbose flag by @yzh119 in #981
  • Triton rms_norm kernels by @nandor in #983
  • Allow passing workspace base directory via environment variable by @jsuchome in #973
  • [CHORE] Rename output_emitted_token_num -> output_emitted_draft_token_num by @jon-chuang in #977
  • ci: switch to on-demand instances if spot instance is interrupted by @yzh119 in #987
  • misc: update devcontainer by @yzh119 in #986
  • ci: add torch 2.6+cu126 wheel by @yzh119 in #985
  • misc: fix devcontainer conda path by @yzh119 in #989
  • perf: prefetch page indices for mla kernel by @yzh119 in #991
  • SM-constraint-GEMM by triton persistent kernel by @yyihuang in #982
  • 3rdparty: upgrade cutlass to 3.9 by @yzh119 in #997
  • perf: add -DNDEBUG compilation flag by @yzh119 in #998
  • release: bump version to v0.2.5 by @yzh119 in #999

New Contributors

Full Changelog: v0.2.4...v0.2.5

v0.2.4

29 Mar 05:09
bc81a59
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.2.3...v0.2.4

v0.2.3

11 Mar 02:22
fdedc43
Compare
Choose a tag to compare

Breaking Changes

We changed the interface for sampling APIs, more specifically (see #912 ):

  • The sampling API removes the success return value of all sampling API, which is not compatible with earlier design.
  • Instead of passing uniform tensor, we changed the sampling interface to accept torch.Generator (optional, https://pytorch.org/docs/stable/generated/torch.Generator.html), to align with the behavior of torch.

What's Changed

  • release: bump version v0.2.2.post1 by @yzh119 in #902
  • Naive Support for Hopper FP8 Prefill Kernel with Per-Head Quantization by @happierpig in #869
  • bugfix: Fix no return type error by @yzh119 in #904
  • ci: add dockerfile for CI by @yzh119 in #909
  • ci: bugfix on release-ci-docker github action by @yzh119 in #910
  • feat: flashinfer intra-kernel profiler by @yzh119 in #913
  • [Package] Add tvm binding to flashinfer.data when packaging by @MasterJH5574 in #917
  • refactor: move triton dependency to flashinfer.triton by @yzh119 in #918
  • sampling: dual pivot rejection sampling algorithm to improve top-p/top-k sampling efficiency by @yzh119 in #912
  • feat: support non-contiguous input/output in normalization functions by @yzh119 in #921
  • feat: improve sampling algorithm robustness by @yzh119 in #923
  • perf: use max probability instead of 1 as upper bound in top-p/k sampling by @yzh119 in #925
  • fix: add install step of profiler's dependency by @zobinHuang in #929
  • fix: undefined symbol cudaGetDriverEntryPointByVersion with CUDA >= 12.5 by @zobinHuang in #928
  • feat: experimenta support of PDL by @yzh119 in #930
  • release: bump version to v0.2.3 by @yzh119 in #932

New Contributors

Full Changelog: v0.2.2.post1...v0.2.3

v0.2.2.post1

27 Feb 06:00
Compare
Choose a tag to compare

What's Changed

  • bump version to v0.2.2 by @yzh119 in #891
  • perf: fix the performance of second stage of split-k by @yzh119 in #894
  • fix: pin_memory use cpu as default device by @KnowingNothing in #895
  • perf: tweak register amount for producer/consumer in MLA template by @yzh119 in #896
  • perf: fix MLA split-k performance bug by @yzh119 in #898
  • perf: use f16 as split-k partial output data type by @yzh119 in #900
  • perf: tweak the pipeline design of mla kernel by @yzh119 in #901

Full Changelog: v0.2.2...v0.2.2.post1

v0.2.2

23 Feb 22:28
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.2.1.post2...v0.2.2

v0.2.1.post2

17 Feb 18:05
8127793
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.2.1.post1...v0.2.1.post2

v0.2.1.post1

13 Feb 23:13
6805c64
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.2.1...v0.2.1.post1

v0.2.1

13 Feb 08:17
dbb1e4e
Compare
Choose a tag to compare

What's Changed

  • misc: addressing the package renaming issues by @yzh119 in #770
  • feat: support deepseek prefill attention shape by @yzh119 in #765
  • refactor: change the structure of attention updater by @yzh119 in #772
  • hotfix: follow up of #772 by @yzh119 in #773
  • bugfix: Ensure Loop Termination by Enforcing IEEE-754 Compliance in Sampling Kernels by @yzh119 in #774
  • bugfix: fix the JIT warmup arguments in unittests by @yzh119 in #775
  • ci: change whl folder to flashinfer-python by @abcdabcd987 in #779
  • perf: refactor fa2 prefill template by @yzh119 in #776
  • feat: Separate QK/VO head dim dispatch for sm90 AOT by @abcdabcd987 in #778
  • bugfix: fix batch prefill attention kernel unittests by @yzh119 in #781
  • misc: remove head dimension 64 from AOT by @yzh119 in #782
  • misc: allow head_dim=64 for sm90 AOT by @abcdabcd987 in #783
  • bugfix: drop CTA_TILE_Q=32 by @abcdabcd987 in #785
  • refactor: make group_size a part of params by @yzh119 in #786
  • bugfix: MLA decode should multiply sm_scale by math::log2e by @tsu-bin in #787
  • fix rope logic in mla decoding by @zhyncs in #793
  • Fix arguments of plan for split QK/VO head dims by @abmfy in #795
  • test: add unittest comparing deepseek prefill fa2 & 3 implementation by @yzh119 in #797
  • bugfix: fix aot build not compatible with cmake command by @tsu-bin in #796
  • Fix the type annotation of q_dtype and kv_dtype on ragged prefill by @nandor in #798
  • feat: support f32 attention output in FA2 template by @yzh119 in #799
  • feat: apply sm_scale at logits instead of q in FA2 template by @yzh119 in #801
  • bugfix: mla decode failed under cuda graph mode, and update test case by @tsu-bin in #803
  • perf: memory efficient deepseek mla fused page-attention kernel by @yzh119 in #804
  • bugfix: mla page-attention kernel for different page sizes by @yzh119 in #810
  • doc: add documentation to new MLA interface by @yzh119 in #811
  • feat: unlocking MLA for A100 by @yzh119 in #812
  • feat: cudagraph-compatible MLA API by @yzh119 in #813
  • feat: unlock MLA attention for sm89 (L40/L40s/4090) by @yzh119 in #814
  • misc: fix sphinx by @abcdabcd987 in #815
  • bugfix: fix the behavior of mla plan function when provided with host tensors by @yzh119 in #816
  • doc: improve mla related documentation by @yzh119 in #818

New Contributors

Full Changelog: v0.2.0.post2...v0.2.1

What's Changed

  • misc: addressing the package renaming issues by @yzh119 in #770
  • feat: support deepseek prefill attention shape by @yzh119 in #765
  • refactor: change the structure of attention updater by @yzh119 in #772
  • hotfix: follow up of #772 by @yzh119 in #773
  • bugfix: Ensure Loop Termination by Enforcing IEEE-754 Compliance in Sampling Kernels by @yzh119 in #774
  • bugfix: fix the JIT warmup arguments in unittests by @yzh119 in #775
  • ci: change whl folder to flashinfer-python by @abcdabcd987 in #779
  • perf: refactor fa2 prefill template by @yzh119 in #776
  • feat: Separate QK/VO head dim dispatch for sm90 AOT by @abcdabcd987 in #778
  • bugfix: fix batch prefill attention kernel unittests by @yzh119 in #781
  • misc: remove head dimension 64 from AOT by @yzh119 in #782
  • misc: allow head_dim=64 for sm90 AOT by @abcdabcd987 in #783
  • bugfix: drop CTA_TILE_Q=32 by @abcdabcd987 in #785
  • refactor: make group_size a part of params by @yzh119 in #786
  • bugfix: MLA decode should multiply sm_scale by math::log2e by @tsu-bin in #787
  • fix rope logic in mla decoding by @zhyncs in #793
  • Fix arguments of plan for split QK/VO head dims by @abmfy in #795
  • test: add unittest comparing deepseek prefill fa2 & 3 implementation by @yzh119 in #797
  • bugfix: fix aot build not compatible with cmake command by @tsu-bin in #796
  • Fix the type annotation of q_dtype and kv_dtype on ragged prefill by @nandor in #798
  • feat: support f32 attention output in FA2 template by @yzh119 in #799
  • feat: apply sm_scale at logits instead of q in FA2 template by @yzh119 in #801
  • bugfix: mla decode failed under cuda graph mode, and update test case by @tsu-bin in #803
  • perf: memory efficient deepseek mla fused page-attention kernel by @yzh119 in #804
  • bugfix: mla page-attention kernel for different page sizes by @yzh119 in #810
  • doc: add documentation to new MLA interface by @yzh119 in #811
  • feat: unlocking MLA for A100 by @yzh119 in #812
  • feat: cudagraph-compatible MLA API by @yzh119 in #813
  • feat: unlock MLA attention for sm89 (L40/L40s/4090) by @yzh119 in #814
  • misc: fix sphinx by @abcdabcd987 in #815
  • bugfix: fix the behavior of mla plan function when provided with host tensors by @yzh119 in #816
  • doc: improve mla related documentation by @yzh119 in #818
  • release: bump version to v0.2.1 by @yzh119 in #819
  • refactor: change to TORCH_LIBRARY by @youkaichao in #764
  • Revert "refactor: change to TORCH_LIBRARY" by @yzh119 in #820
  • bugfix: bugfix on sm89 MLA by @yzh119 in #821
  • hotfix: bugfix on #812 by @yzh119 in #822
  • refactor: change to TORCH_LIBRARY by @abmfy in #823

New Contributors

Full Changelog: v0.2.0.post2...v0.2.1

v0.2.0.post2

31 Jan 19:49
200e954
Compare
Choose a tag to compare

What's Changed

  • ci: fix the update_whl_index script to regonize version number with "post" and add torch2.5 by @yzh119 in #694
  • bugfix: casting int array to int32 for rope input arguments by @yzh119 in #697
  • bugfix: only use sm90 group gemm when torch cuda >= 12.3 by @yzh119 in #699
  • misc: remove release-please workflow by @yzh119 in #705
  • Customizable SM90 prefill kernels. by @hyhieu in #704
  • hotfix: revert torch.library register by @yzh119 in #709
  • Improve compatibility with pytorch 2.5 by @zifeitong in #711
  • misc: add bibtex reference by @yzh119 in #712
  • sampling: simplify min-p sampling by @yzh119 in #713
  • perf: fix the iteration bound of SWA in FA2 prefill template by @yzh119 in #714
  • bugfix: fix min-p AOT compilation in #713 by @yzh119 in #717
  • Triton implementation of silu_and_mul by @nandor in #716
  • bugfix: FusedAddRMSNorm kernels might require more than 48KB shared memory when d is large. by @bobboli in #718
  • bugfix: Choose sm90 kernels only for Hopper GPUs. by @bobboli in #719
  • Finer-grained control over fp16/fp8 builds by @nandor in #722
  • Align KV chunk size binary search with actual KV chunk splitting. by @timzsu in #728
  • ci: rename python package name to flashinfer-python by @yzh119 in #729
  • Add a note about int32/int64 datatypes to the kv_layout tutorial by @fergusfinn in #737
  • fix return type of cuBLAS by @zhyncs in #749
  • [Refactor] Unify JIT/Customization/AOT mode by @yzh119 in #748
  • Move allocations out of torch ops by @nandor in #740
  • [Lint] Fix some linting issues and provide automatic format check script by @LeiWang1999 in #743
  • Filter out unsupported head dim for sm90 by @abcdabcd987 in #751
  • bugfix: various AOT issues by @abcdabcd987 in #752
  • [bugfix] Fix cpp tests/benchmarks by @yzh119 in #753
  • fix pin memory device by @youkaichao in #755
  • Add dev container for easier development by @ByronHsu in #680
  • hotfix: bugfix to #756 by @yzh119 in #757
  • Change apply_rope_with_cos_sin_cache to accept cos_sin_cache by @ByronHsu in #754
  • fix: match statement not supported in Python 3.8 by @xslingcn in #759
  • bugfix: use actual sm count for num_sm90_ctas by @LLLLKKKK in #762
  • bugfix: Fix block-sparse attention API by @yzh119 in #767
  • Version bump: v0.2.0.post2 by @yzh119 in #768

New Contributors

Full Changelog: v0.2.0.post1...v0.2.0.post2

v0.2.0.post1

23 Dec 00:49
Compare
Choose a tag to compare

0.2.0.post1 (2024-12-22)

Bug Fixes

  • bug fix on determine_attention_backend condition (#688) (bcf7a3e)
  • accelerate plan speed of fa3 template (#690) (db8f04d)