-
-
Notifications
You must be signed in to change notification settings - Fork 7.4k
FP32 RoPE kernel #1061
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FP32 RoPE kernel #1061
Conversation
This PR is essential for Code-Llama to generate correct code. Without FP32 RoPE, Code-Llama can't generate the correct indentation, leading to very low HumanEval scores.
Example (vLLM): from typing import List
def has_close_elements(numbers: List[float], threshold: float) -> bool:
""" Check if in given list of numbers, are any two numbers closer to each other than
given threshold.
>>> has_close_elements([1.0, 2.0, 3.0], 0.5)
False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
True
"""
for i in range(len(numbers) - 1): # wrong indentation in this line
if abs(numbers[i] - numbers[i + 1]) <= threshold:
return True
return False
Example (vLLM + FP32 RoPE): from typing import List
def has_close_elements(numbers: List[float], threshold: float) -> bool:
""" Check if in given list of numbers, are any two numbers closer to each other than
given threshold.
>>> has_close_elements([1.0, 2.0, 3.0], 0.5)
False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
True
"""
for i in range(len(numbers) - 1):
if abs(numbers[i] - numbers[i + 1]) <= threshold:
return True
return False
|
Hi @imoneoi, thanks for bringing this up. We discussed this issue in #998 (comment), where we observed only using FP32 for initializing RoPE is enough to preserve the accuracy. It seems your evaluation result is not consistent with that. Could you provide a script for evaluation? |
Hi @WoosukKwon my results were tested with Code-Llama weights and vLLM server: python -m vllm.entrypoints.openai.api_server --model imone/CodeLlama_13B_with_EOT_token --host 127.0.0.1 --port 5000 --max-num-batched-tokens 16384 --worker-use-ray --engine-use-ray EvalPlus python -m codegen.generate --model codegen-16b --bs 1 --temperature 0 --greedy --n_samples 1 --root ./data/codellama_13b_fp32_rope
docker run -v $(pwd):/app ganler/evalplus:latest --dataset humaneval --samples ./data/codellama_13b_fp32_rope/humaneval/codegen-16b_temp_0.0/ |
@imoneoi Got it. Could you
|
@WoosukKwon Do you know the difference in rounding methods between PyTorch |
We also tested CodeLlama-Instruct 13B. FP32 RoPE has about 1% improvement on HumanEval+. Before PR:
After PR:
|
While I believe this should not be the case, I also experience the weird precision error when applying this PR... |
This PR is stuck because we found a weird error in the test. More specifically, in the branch |
@imoneoi @WoosukKwon I test on same model: codellama-13b-instruct. Before(latest main branch) and after PR, results are same. Base |
I have spent some time trying out different rounding modes in the kernel and none of them makes all the tests pass. My assumption would be something in the reference implementation is most likely causing it. Here are the max and min differences between kernel and reference for the first test case (
|
@Yard1 Yeah, this one is really really weird. I've checked out both reference implementation and our kernel multiple times to find any potential source of the precision error, but totally failed. |
What are your EvalPlus settings? The results seem much higher |
@Yard1 I also cannot figure out the precision issue. Is it because of the non-associativity of floating-point computations? |
@imoneoi I changed the system prompt. And some post processing. |
No description provided.