-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Exponential Backoff Mechanism for RateLimit Issues in /Chat #500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exponential Backoff Mechanism for RateLimit Issues in /Chat #500
Conversation
Thanks @vrajroutu ! We were just discussing this today. Did you do loadtesting with this in place? I'm wondering if the backoff ends up increasing the requests overall, putting users in competition with each other, or if you find it's working well to alleviate "burst" situations. Also I assume you're maxing your deployments to 240K for this situation. |
Firstly, enabling streaming in the environment has helped to reduce rate limit issues, but even with 100+ users on the platform simultaneously, I still observed a few rate limit problems. In my exploration of solutions, I came across various articles, including this one AvoidRateLimits. This change won't impact the existing functionality, but during high loads on the environment and requests to the OAI model, it will be beneficial. A similar change was successfully implemented in the prep docs to address rate limit issues in indexing the documents. I plan to conduct load testing and will update the results accordingly. This change is especially beneficial for users utilizing GPT-4, as the TPM is limited to 90k per subscription. By implementing the suggested approach, we can better manage the rate limits and ensure a smoother user experience, even during peak usage periods. |
Ah I didn't realize GPT-4 had a lower TPM limit. Yeah, the backoff technique works well in prepdocs.py where there's a single caller to the API - my concern with using it for the per-user API calls is that it may increase load overall, if there truly are too many users than it can handle. I could see this approach being able to smooth over spikes of activity, but not being the solution to a long period of high load. Other ideas: request more quota, put a message asking users to reload the page (thus reducing their history), just removing the history entirely from messages. Let us know how it goes in production! |
Absolutely, I agree with your points. Increasing the capacity of GPT-4 would certainly be a long-term solution, but since it's currently paused by the MS team, we need to find ways to optimize the current setup. Enabling streaming has already shown positive results in reducing rate limit issues, and incorporating the backoff technique will further help in smoothing out spikes in activity. I understand that users might value keeping the chat history and continuing their conversations seamlessly. We can explore different options to manage the rate limit, like gradually requesting more quota and monitoring the usage patterns. If the backoff strategy combined with streaming can alleviate the majority of the rate limit issues and provide a good user experience, that would be a positive step forward. i will keep an eye on the system's performance. |
This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed. |
I think we may still want to merge this, I just want to do more loadtesting first. |
This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed. |
@vrajroutu Are you still using tenacity in this situation, now that the OpenAI SDK has built-in retries and a customizable max_retries parameter? It seems less necessary these days. |
Hi @pamelafox, I don't think it's necessary anymore. Users can now utilize API management or an app gateway to scale Azure OpenAI to multiple instances, and it works quite effectively. |
Purpose
Does this introduce a breaking change?
Pull Request Type
What kind of change does this Pull Request introduce?
When releasing this app in production with 100+ users, there is a possibility of encountering rate limit issues when using the chat feature. To mitigate this, we can implement tenacity exponential backoff. This approach will automatically retry the chat requests if a rate limit is encountered, and it will provide the appropriate response once the rate limit has been lifted. By incorporating tenacity exponential backoff, the app will handle rate limit scenarios more gracefully and provide a smoother user experience during periods of high traffic or usage.
Reference : https://github.com/openai/openai-cookbook/blob/main/examples/How_to_handle_rate_limits.ipynb
How to Test
azd deploy
What to Check
Verify that the following are valid
Other Information