RoPE should be applied with float32 #863

imoneoi · 2023-08-25T03:49:50Z

It seems that RoPE(sin,cos) should be stored and applied in fp32 and then casted back to fp16/bf16

vllm/vllm/model_executor/layers/attention.py

Line 273 in 791d79d

cache = cache.to(torch_dtype)

Reference implementation from Llama 2 / Code Llama:

def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
    t = torch.arange(end, device=freqs.device, dtype=torch.float32)  # type: ignore
    freqs = torch.outer(t, freqs)  # type: ignore
    freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # complex64
    return freqs_cis

def apply_rotary_emb(
    xq: torch.Tensor,
    xk: torch.Tensor,
    freqs_cis: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor]:
    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
    freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
    return xq_out.type_as(xq), xk_out.type_as(xk)

WoosukKwon · 2023-08-25T11:17:35Z

Hi @imoneoi, thanks for pointing this out and submitting a PR to fix it. To my understanding, the data type used in RoPE is different in Meta's original LLaMA implementation (you attached here) and HF's. If I understand correctly, in HF Transformers, the cos and sin embeddings are converted to the input q, k data type while q, k tensors keep their original data type. I've checked that our current kernel implementation passes the unit test (https://github.com/vllm-project/vllm/blob/main/tests/kernels/test_pos_encoding.py) while your implementation in #870 does not.

In #870, could we make it optional to cast the intermediate tensors to float32? As I believe most people expect vLLM to be compatible with HF transformers, I'd like to keep the current behavior as default. However, it'd be nice to have the float32 casting as an option for advanced users who care about the original implementation.

imoneoi · 2023-08-25T14:12:44Z

Yes, I understand your concern. I'm mostly thinking about the latest code llama. When theta is large, the accuracy problem may be exacerbated, so we may need full precision RoPE.

I can implement the cast option and add an extra kernel for full-precision RoPE. By the way, this PR is editable and I welcome your changes.

imoneoi · 2023-08-25T14:17:07Z

For the unit test, I wonder if it's because some differences in type casting? Is Tensor.to the same with static_cast<>?

Looks like only 1 item in the whole array have little rounding errors

hmellor · 2024-03-08T11:20:52Z

Closing as this now appears to be resolved.

imoneoi mentioned this issue Aug 25, 2023

RoPE in float32 precision #870

Closed

WoosukKwon added the enhancement label Sep 5, 2023

WoosukKwon mentioned this issue Sep 10, 2023

Use FP32 in RoPE initialization #1004

Merged

hmellor closed this as completed Mar 8, 2024

hmellor added feature request New feature or request and removed enhancement labels Feb 27, 2025

Kuangdd01 mentioned this issue Apr 10, 2025

Qwen2.5-Omni Full SFT with flash_att2: AssertionError: Input and cos/sin must have the same dtype, got torch.float32 and torch.bfloat16 hiyouga/LLaMA-Factory#7663

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RoPE should be applied with float32 #863

RoPE should be applied with float32 #863

imoneoi commented Aug 25, 2023

WoosukKwon commented Aug 25, 2023

imoneoi commented Aug 25, 2023

imoneoi commented Aug 25, 2023 •

edited

Loading

hmellor commented Mar 8, 2024

RoPE should be applied with float32 #863

RoPE should be applied with float32 #863

Comments

imoneoi commented Aug 25, 2023

WoosukKwon commented Aug 25, 2023

imoneoi commented Aug 25, 2023

imoneoi commented Aug 25, 2023 • edited Loading

hmellor commented Mar 8, 2024

imoneoi commented Aug 25, 2023 •

edited

Loading