FLOOD

FLOOD, a throughput-oriented framework with pipeline parallism and segmentable cache.

News or Update 🔥

[2025/03] We release the code of our inference framework FLOOD.

Introduction

Flood is a highly effective inference framework designed for offline applications. It employs a pipeline parallelism (PP) approach to minimize communication costs associated with tensor parallelism (TP). This framework incorporates advanced scheduling strategies tailored for offline inference processes to optimize GPU utilization to its fullest potential.

Furthermore, Flood utilizes segmentable blocks instead of paged blocks for kvcache management, thereby enhancing the continuity of the kvcache for requests.

Additionally, we have developed an attention kernel, termed SegmentAttention, to function with the segmentable kvcache. Flood currently supports a range of features, including:

Zero-overhead continuous batching
Chunked prefill
Inference of Quantization(FP8/INT8) models
Inference of multi-modal models
Streaming inference
PPL (Perplexity) evaluation
Sampling methods
Multi-node inference(experimental)

Our framework is undergoing rapid iteration, which may result in some features having bugs. If you encounter any issues, please feel free to report them.

Models we support

Ling MoE
Ling
Llama
Qwen
Deepseek v1

Roadmap

Integrate our previous work LOOKAHEAD.
Improve prefill performance with Prefix caching.
Improve performance with CUDA-Graph.
Support more models, include Deepseek R1, etc.
Implement segment attention with CUTE for better performance, especially with FP8 kvcache.

Performance Comparison

Throughput

Performance is measured by token/s(tokens per second) of generated tokens. The version of vLLM is 0.6.6.post2, we enable the chunk prefill with chunk size 2048, other parameters are the same as default. The model archetechure of Ling can be found in the Ling technical report.

model	dataset	GPU	vLLM	flood	speedup
Llama3-8B	shareGPT	1*A100	3201	4529	1.41
Ling-Lite	shareGPT	1 * H20	4355	5869	1.35
Ling-Lite	shareGPT	1 * A100	3576	5451	1.52
Ling-Plus(FP8)	shareGPT	8 * H20	2742	6569	2.40

Kernels

Performance is measured by TFLOPS (TFLOPs/second). Attention head number is 64, kv head number is 8, and kv head dimension is 128. We use flash_attn_2_cuda.varlen_fwd of flash-attn-2 in A100 and flash_attn_3_cuda.fwd of flash-attn-3 in H20. More detail can be found in benchmark/bench_seg_attn.py.

Device	BatchSize	Q_len	K_len	flash-attn	seg-attn	speedup
A100	1	1024	1024	99.19	107.35	1.08
A100	128	1	1024	10.65	13.56	1.27
H20	1	1024	1024	90.28	96.05	1.06
H20	128	1	1024	7.16	22.63	3.16

Installation

Clone this repository and navigate to PainlessInferenceAcceleration

git clone https://github.com/alipay/PainlessInferenceAcceleration.git
cd PainlessInferenceAcceleration/flood

Install Package

python setup.py install

requirements

We mainly develop and benchmark on the environment below, lower version may also be OK.

cuda >= 12.4 (higher is better)
torch >= 2.5.0 (higher is better)
triton >= 3.1.0 (higher is better)
accelerate >= 1.4.0
transformers >= 4.47.1
flash-attn >= 2.6.3 is required if use fa2 kernel
flash-attn-3 >= 3.0.0 is required if use fa3 kernel
vLLM >= 0.6.2 is required if use INT8 quantization

Quick Start

A simple example can be found in example/simple_example.py.

To reproduce the reported performance, run the benchmark/bench_flood.py.

ACKNOWLEDGE

Flood is inspired by FlashAttention 2&3, FasterTransformer, vLLM, flashinfer projects.

Citations

[TBD]

@misc{zhao2025flood,
title={Flood: An throughput-oriented Inference Framework for Large Language Model with pipeline parallelism and segmentable cache},
author={Yao Zhao and Chen Liang and Jingyu Hu and Zixuan Cheng and Zhen Wang and Longfei Li}
}

Contact Us

For technical questions and feature requests, please use Github issues or discussions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

FLOOD

News or Update 🔥

Introduction

Models we support

Roadmap

Performance Comparison

Throughput

Kernels

Installation

requirements

Quick Start

ACKNOWLEDGE

Citations

Contact Us

Files

README.md

Latest commit

History

README.md

File metadata and controls

FLOOD

News or Update 🔥

Introduction

Models we support

Roadmap

Performance Comparison

Throughput

Kernels

Installation

requirements

Quick Start

ACKNOWLEDGE

Citations

Contact Us