Feature Req : Handle the Ratelimits in chatreadretrive #496

vrajroutu · 2023-08-03T15:51:57Z

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [ ] bug report -> please search issues before submitting
- [ X] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Users often see Rate Limit Issues, we should have an ability to add exponential backoff as mentioned by OAI

https://github.com/openai/openai-cookbook/blob/main/examples/How_to_handle_rate_limits.ipynb

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

azd version?

run azd version and copy paste here.

Versions

Mention any other details that might be useful

Thanks! We'll be in touch soon.

The text was updated successfully, but these errors were encountered:

pamelafox · 2023-08-24T04:45:41Z

Linking your PR here: #500

In our load tests, we were able to increase TPM to the max (120/240 depending on model) and then did not run into rate limits with the simulated users (50). Developers should first increase TPM as much as possible, and then consider implementing backoff, but keep in mind that the backoff will be the most useful to smooth over spikes, not for sustained excess TPM. In that case, developers need to increase more TPM or load balance (as you've noted in another issue).

vikramhn · 2023-09-25T23:59:00Z

@pamelafox Unsure how the load test was done, how many tokens did each request consume. If we take 3000 for a chat request and with 40 simultaneous users, a 120tpm limit deployment will start to experience ratelimiterror. For 240tpm max can handle 80 users, beyond that custom handling of load/retry need to be implemented. Is my assumption correct?

Linking your PR here: #500

In our load tests, we were able to increase TPM to the max (120/240 depending on model) and then did not run into rate limits with the simulated users (50). Developers should first increase TPM as much as possible, and then consider implementing backoff, but keep in mind that the backoff will be the most useful to smooth over spikes, not for sustained excess TPM. In that case, developers need to increase more TPM or load balance (as you've noted in another issue).

vrajroutu · 2023-09-26T15:52:45Z

@vikramhn also consider adding Application gateway and add OpenAI instances.

Reference Article : https://www.raffertyuy.com/raztype/azure-openai-load-balancing/

vikramhn · 2023-09-26T16:35:51Z

@vrajroutu thank you for the reference article. Very cool solution for the time being. Hope it can be abstracted and incorporated into an Azure Enterprise grade OpenaAI as a premium product offering in the future.

pamelafox · 2023-09-26T20:27:39Z

@vikramhn For my test, each request took about 1000 tokens, so it could handle a bit more. I think 3000 is a reasonable assumption as well however, since requests get longer as users ask more questions, plus some questions may have longer answers. You can see my loadtest in the locustfile.py in the root of this repo.

https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/locustfile.py

I have passed on feedback to the Azure OpenAI teams on how difficult it can be to work with the current TPM and rate-limits.

vikramhn · 2023-09-26T22:57:29Z

Thanks @pamelafox for taking the feedback to the product team. Nice work on the load test and thanks for the link.

github-actions · 2023-12-16T01:45:06Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this issue will be closed.

github-actions bot added the Stale label Dec 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Req : Handle the Ratelimits in chatreadretrive #496

Feature Req : Handle the Ratelimits in chatreadretrive #496

vrajroutu commented Aug 3, 2023

Please provide us with the following information:

pamelafox commented Aug 24, 2023

vikramhn commented Sep 25, 2023

vrajroutu commented Sep 26, 2023

vikramhn commented Sep 26, 2023

pamelafox commented Sep 26, 2023

vikramhn commented Sep 26, 2023

github-actions bot commented Dec 16, 2023

Feature Req : Handle the Ratelimits in chatreadretrive #496

Feature Req : Handle the Ratelimits in chatreadretrive #496

Comments

vrajroutu commented Aug 3, 2023

Please provide us with the following information:

This issue is for a: (mark with an x)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

azd version?

Versions

Mention any other details that might be useful

pamelafox commented Aug 24, 2023

vikramhn commented Sep 25, 2023

vrajroutu commented Sep 26, 2023

vikramhn commented Sep 26, 2023

pamelafox commented Sep 26, 2023

vikramhn commented Sep 26, 2023

github-actions bot commented Dec 16, 2023

This issue is for a: (mark with an `x`)