Skip to content

Commit e58e377

Browse files
elfieggmgoin
authored andcommitted
[Core] Default to using per_token quantization for fp8 when cutlass is supported. (vllm-project#8651)
Signed-off-by: mgoin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: mgoin <[email protected]> Signed-off-by: Isotr0py <[email protected]>
1 parent bd43c25 commit e58e377

File tree

1 file changed

+2
-1
lines changed
  • vllm/model_executor/layers/quantization

1 file changed

+2
-1
lines changed

vllm/model_executor/layers/quantization/fp8.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -355,7 +355,8 @@ def apply(self,
355355
input_scale=layer.input_scale,
356356
bias=bias,
357357
cutlass_fp8_supported=self.cutlass_fp8_supported,
358-
use_per_token_if_dynamic=False)
358+
# Default to using per_token quantization if cutlass is supported
359+
use_per_token_if_dynamic=self.cutlass_fp8_supported)
359360

360361

361362
class Fp8MoEMethod(FusedMoEMethodBase):

0 commit comments

Comments
 (0)