Skip to content

RoPE scaling support? #464

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nivibilla opened this issue Jul 14, 2023 · 11 comments
Closed

RoPE scaling support? #464

nivibilla opened this issue Jul 14, 2023 · 11 comments
Labels
feature request New feature or request

Comments

@nivibilla
Copy link

HF Merged RoPE scaling into their library. This allows to increase context length by 4x without retraining.

@WoosukKwon WoosukKwon added the feature request New feature or request label Jul 14, 2023
@WoosukKwon
Copy link
Collaborator

Hi @nivibilla, thanks for letting us know the integration! We are interested in the RoPe scaling. Will investigate it.

@nivibilla
Copy link
Author

Thank you! Looking forward to it.

@lucasjinreal
Copy link

@WoosukKwon Same request here, another issue: #479

NTK could performance much more better then PI. It make vllm able to inference more than 16k without users finetune their models.

@nivibilla
Copy link
Author

@lucasjinreal yes that's right. And I've seen your issue. Can you make a PR? I'd be happy to test it out!

@lucasjinreal
Copy link

@nivibilla I have pasted main modification in that issue. Please make some adoption to your vllm base. (my base has messy codes pr would be cubersome), feel free to ask me any question if you got a chance to try.

@nivibilla
Copy link
Author

@lucasjinreal Sure, no problem. I will try it out

@kir152
Copy link

kir152 commented Sep 21, 2023

hey any update on this?

@viktor-ferenczi
Copy link
Contributor

viktor-ferenczi commented Sep 26, 2023

I will give it a try, because I have to learn into positional encoding anyway due to my planned work on #1161

For my reference, this former PR shows where the RoPE code is: #1004

@viktor-ferenczi
Copy link
Contributor

Rope scaling seems to be added already in #555

I'm not sure how/whether to proceed here.

@WoosukKwon
Copy link
Collaborator

Implemented by #555

pi314ever pushed a commit to pi314ever/vllm that referenced this issue Nov 20, 2024
Current implementation of optimized topp/topk calculations for scalar
case is handling the duplicates that are outside of kth border.
Unfortunately, to analyze duplicates it is necessary to make a
synchronization with CPU, what makes multi-step scheduling useless
together with topp/topk.

This PR adds option to skip duplicates handling with
`VLLM_HANDLE_TOPK_DUPLICATES` (default `True`). When this variable is
set, handling duplicates will be skipped and we will avoid
synchronization with CPU. It also removes the synchronization which was
done earlier in Sampler, by saving scalar value of `top_k` and `top_p`.
It should give performance gain for all benchmarks with these sampling
parameters, especially together with multi-step scheduling.

While disabling the duplicates handling may cause small accuracy
differences, the best solution will be to handle duplicates without
synchronization with CPU. However, this is not a trivial problem, so I
will try to provide such solution later.
@hahmad2008
Copy link

hahmad2008 commented Nov 26, 2024

@WoosukKwon @youkaichao @nivibilla How it gonna extend the context length? can you elaborate? if i need to extend length of llama3.1-8b from 8k to 128k, can I do that? how? by only using these args? --rope-scaling and --rope-theta
how to configure them?

scarecr0w12 pushed a commit to scarecr0w12/vllm that referenced this issue May 27, 2025
* Fixing the shape to use in padding calculation

* Assertion on the int8 quantized moe

* Properly testing for padding
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants