-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Feature Req : Handle the Ratelimits in chatreadretrive #496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Linking your PR here: #500 In our load tests, we were able to increase TPM to the max (120/240 depending on model) and then did not run into rate limits with the simulated users (50). Developers should first increase TPM as much as possible, and then consider implementing backoff, but keep in mind that the backoff will be the most useful to smooth over spikes, not for sustained excess TPM. In that case, developers need to increase more TPM or load balance (as you've noted in another issue). |
@pamelafox Unsure how the load test was done, how many tokens did each request consume. If we take 3000 for a chat request and with 40 simultaneous users, a 120tpm limit deployment will start to experience ratelimiterror. For 240tpm max can handle 80 users, beyond that custom handling of load/retry need to be implemented. Is my assumption correct?
|
@vikramhn also consider adding Application gateway and add OpenAI instances. Reference Article : https://www.raffertyuy.com/raztype/azure-openai-load-balancing/ |
@vrajroutu thank you for the reference article. Very cool solution for the time being. Hope it can be abstracted and incorporated into an Azure Enterprise grade OpenaAI as a premium product offering in the future. |
@vikramhn For my test, each request took about 1000 tokens, so it could handle a bit more. I think 3000 is a reasonable assumption as well however, since requests get longer as users ask more questions, plus some questions may have longer answers. You can see my loadtest in the locustfile.py in the root of this repo. https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/locustfile.py I have passed on feedback to the Azure OpenAI teams on how difficult it can be to work with the current TPM and rate-limits. |
Thanks @pamelafox for taking the feedback to the product team. Nice work on the load test and thanks for the link. |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this issue will be closed. |
This issue is for a: (mark with an
x
)Minimal steps to reproduce
Users often see Rate Limit Issues, we should have an ability to add exponential backoff as mentioned by OAI
https://github.com/openai/openai-cookbook/blob/main/examples/How_to_handle_rate_limits.ipynb
Any log messages given by the failure
Expected/desired behavior
OS and Version?
azd version?
Versions
Mention any other details that might be useful
The text was updated successfully, but these errors were encountered: