FLOOD, a throughput-oriented framework with pipeline parallism and segmentable cache.
- [2025/03] We release the code of our inference framework
FLOOD
.
Flood is a highly effective inference framework designed for offline applications. It employs a pipeline parallelism (PP) approach to minimize communication costs associated with tensor parallelism (TP). This framework incorporates advanced scheduling strategies tailored for offline inference processes to optimize GPU utilization to its fullest potential.
Furthermore, Flood utilizes segmentable blocks instead of paged blocks for kvcache management, thereby enhancing the continuity of the kvcache for requests.
Additionally, we have developed an attention kernel, termed SegmentAttention, to function with the segmentable kvcache. Flood currently supports a range of features, including:
- Zero-overhead continuous batching
- Chunked prefill
- Inference of Quantization(FP8/INT8) models
- Inference of multi-modal models
- Streaming inference
- PPL (Perplexity) evaluation
- Sampling methods
- Multi-node inference(experimental)
Our framework is undergoing rapid iteration, which may result in some features having bugs. If you encounter any issues, please feel free to report them.
- Ling MoE
- Ling
- Llama
- Qwen
- Deepseek v1
-
Integrate our previous work
LOOKAHEAD
. -
Improve prefill performance with Prefix caching.
-
Improve performance with CUDA-Graph.
-
Support more models, include Deepseek R1, etc.
-
Implement segment attention with
CUTE
for better performance, especially with FP8 kvcache.
Performance is measured by token/s(tokens per second) of generated tokens. The version of vLLM is 0.6.6.post2, we enable the chunk prefill with chunk size 2048, other parameters are the same as default. The model archetechure of Ling can be found in the Ling technical report.
model | dataset | GPU | vLLM | flood | speedup |
---|---|---|---|---|---|
Llama3-8B | shareGPT | 1*A100 | 3201 | 4529 | 1.41 |
Ling-Lite | shareGPT | 1 * H20 | 4355 | 5869 | 1.35 |
Ling-Lite | shareGPT | 1 * A100 | 3576 | 5451 | 1.52 |
Ling-Plus(FP8) | shareGPT | 8 * H20 | 2742 | 6569 | 2.40 |
Performance is measured by TFLOPS (TFLOPs/second). Attention head number is 64, kv head number is 8, and kv head dimension is 128. We use flash_attn_2_cuda.varlen_fwd
of flash-attn-2 in A100 and flash_attn_3_cuda.fwd
of flash-attn-3 in H20.
More detail can be found in benchmark/bench_seg_attn.py.
Device | BatchSize | Q_len | K_len | flash-attn | seg-attn | speedup |
---|---|---|---|---|---|---|
A100 | 1 | 1024 | 1024 | 99.19 | 107.35 | 1.08 |
A100 | 128 | 1 | 1024 | 10.65 | 13.56 | 1.27 |
H20 | 1 | 1024 | 1024 | 90.28 | 96.05 | 1.06 |
H20 | 128 | 1 | 1024 | 7.16 | 22.63 | 3.16 |
- Clone this repository and navigate to PainlessInferenceAcceleration
git clone https://github.com/alipay/PainlessInferenceAcceleration.git
cd PainlessInferenceAcceleration/flood
- Install Package
python setup.py install
We mainly develop and benchmark on the environment below, lower version may also be OK.
- cuda >= 12.4 (higher is better)
- torch >= 2.5.0 (higher is better)
- triton >= 3.1.0 (higher is better)
- accelerate >= 1.4.0
- transformers >= 4.47.1
- flash-attn >= 2.6.3 is required if use
fa2
kernel - flash-attn-3 >= 3.0.0 is required if use
fa3
kernel - vLLM >= 0.6.2 is required if use INT8 quantization
A simple example can be found in example/simple_example.py
.
To reproduce the reported performance, run the benchmark/bench_flood.py
.
Flood is inspired by FlashAttention 2&3, FasterTransformer, vLLM, flashinfer projects.
[TBD]
@misc{zhao2025flood,
title={Flood: An throughput-oriented Inference Framework for Large Language Model with pipeline parallelism and segmentable cache},
author={Yao Zhao and Chen Liang and Jingyu Hu and Zixuan Cheng and Zhen Wang and Longfei Li}
}
For technical questions and feature requests, please use Github issues or discussions.