-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
RoPE scaling support? #464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @nivibilla, thanks for letting us know the integration! We are interested in the RoPe scaling. Will investigate it. |
Thank you! Looking forward to it. |
@WoosukKwon Same request here, another issue: #479 NTK could performance much more better then PI. It make vllm able to inference more than 16k without users finetune their models. |
@lucasjinreal yes that's right. And I've seen your issue. Can you make a PR? I'd be happy to test it out! |
@nivibilla I have pasted main modification in that issue. Please make some adoption to your vllm base. (my base has messy codes pr would be cubersome), feel free to ask me any question if you got a chance to try. |
@lucasjinreal Sure, no problem. I will try it out |
hey any update on this? |
Rope scaling seems to be added already in #555 I'm not sure how/whether to proceed here. |
Implemented by #555 |
Current implementation of optimized topp/topk calculations for scalar case is handling the duplicates that are outside of kth border. Unfortunately, to analyze duplicates it is necessary to make a synchronization with CPU, what makes multi-step scheduling useless together with topp/topk. This PR adds option to skip duplicates handling with `VLLM_HANDLE_TOPK_DUPLICATES` (default `True`). When this variable is set, handling duplicates will be skipped and we will avoid synchronization with CPU. It also removes the synchronization which was done earlier in Sampler, by saving scalar value of `top_k` and `top_p`. It should give performance gain for all benchmarks with these sampling parameters, especially together with multi-step scheduling. While disabling the duplicates handling may cause small accuracy differences, the best solution will be to handle duplicates without synchronization with CPU. However, this is not a trivial problem, so I will try to provide such solution later.
@WoosukKwon @youkaichao @nivibilla How it gonna extend the context length? can you elaborate? if i need to extend length of llama3.1-8b from 8k to 128k, can I do that? how? by only using these args? |
* Fixing the shape to use in padding calculation * Assertion on the int8 quantized moe * Properly testing for padding
HF Merged RoPE scaling into their library. This allows to increase context length by 4x without retraining.
The text was updated successfully, but these errors were encountered: