Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add middleware for request prioritization #33951

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

bohde
Copy link
Contributor

@bohde bohde commented Mar 20, 2025

This adds a middleware for overload protection, that is intended to help protect against malicious scrapers. It does this via codel, which will perform the following:

  1. Limit the number of in-flight requests to some user defined max
  2. When in-flight requests have reached their max, begin queuing requests, with logged in requests having priority above logged out requests
  3. Once a request has been queued for too long, it has a percentage chance to be rejected based upon how overloaded the entire system is.

When a server experiences more traffic than it can handle, this has the effect of keeping latency low for logged in users, while rejecting just enough requests from logged out users to keep the service from being overloaded.

Below are benchmarks showing a system at 100% capacity and 200% capacity in a few different configurations. The 200% capacity is shown to highlight an extreme case. I used hey to simulate the bot traffic:

hey -z 1m -c 10 "http://localhost:3000/rowan/demo/issues?state=open&type=all&labels=&milestone=0&project=0&assignee=0&poster=0&q=fix"

The concurrency of 10 was chosen from experiments where my local server began to experience higher latency.

Results

Method Concurrency p95 latency Successful RPS Requests Dropped
QoS Off 10 0.2960s 44 rps 0%
QoS Off 20 0.5667s 44 rps 0%
QoS On 20 0.4409s 48 rps 10%
QoS On 50% Logged In* 10 0.3891s 33 rps 7%
QoS On 50% Logged Out* 10 2.0388s 13 rps 6%

Logged in users were given the additional parameter -H "Cookie: i_like_gitea=<session>.

Tests with * were run at the same time, representing a workload with mixed logged in & logged out users. Results are separated to show prioritization, and how logged in users experience a 100ms latency increase under load, compared to the 1.6 seconds logged out users see.

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Mar 20, 2025
@pull-request-size pull-request-size bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 20, 2025
@github-actions github-actions bot added modifies/go Pull requests that update Go code modifies/dependencies docs-update-needed The document needs to be updated synchronously labels Mar 20, 2025
@bohde bohde force-pushed the rb/request-qos branch 3 times, most recently from f3096c1 to 6c499a9 Compare March 20, 2025 18:14
@wxiaoguang
Copy link
Contributor

Results ....

TBH, according to your test result, I do not see it is really useful .......

Copy link
Member

@a1012112796 a1012112796 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at first step, just distinguish between login and not is enough. and looks it can be added for the api router also.

@bohde
Copy link
Contributor Author

bohde commented Mar 21, 2025

TBH, according to your test result, I do not see it is really useful .......

We've been running this patch for awhile, and while it takes a bit of tuning to get the right settings for a given server it has alleviated a lot of outage concerns from waves of scrapers that don't respect robots.txt.

I think at first step, just distinguish between login and not is enough. and looks it can be added for the api router also.

This was in my initial version we ran, but then users who are attempting to login can still experience lower priority traffic. Since the major source of traffic for public code hosting are scrapers trying to download content, deprioritizing those routes mitigates that concern.

@wxiaoguang
Copy link
Contributor

TBH, according to your test result, I do not see it is really useful .......

We've been running this patch for awhile, and while it takes a bit of tuning to get the right settings for a given server it has alleviated a lot of outage concerns from waves of scrapers that don't respect robots.txt.

So the original intention is to fight with crawlers/spiders/robots?

@bohde
Copy link
Contributor Author

bohde commented Mar 21, 2025

TBH, according to your test result, I do not see it is really useful .......

We've been running this patch for awhile, and while it takes a bit of tuning to get the right settings for a given server it has alleviated a lot of outage concerns from waves of scrapers that don't respect robots.txt.

So the original intention is to fight with crawlers/spiders/robots?

Yes, I mentioned this in my description

This adds a middleware for overload protection, that is intended to help protect against malicious scrapers

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Mar 21, 2025

Yes, I mentioned this in my description

Sorry, missed that part. The "perform the following" part and "result" part attracted my attention .....

If the intention is to "protect against malicious scrapers", I think the parts := []string{ could be improved, a similar function like ParseGiteaSiteURL could help the route path handling. And I think we need some tests to cover the new middleware's behavior.


Or like https://github.com/go-gitea/gitea/pull/33951/files#r2006668195 said, mark the routes with priority

@wxiaoguang
Copy link
Contributor

I think I have some ideas about how to handle these "route paths" clearly. Could I push some commits into your branch?

@bohde
Copy link
Contributor Author

bohde commented Mar 21, 2025

I think I have some ideas about how to handle these "route paths" clearly. Could I push some commits into your branch?

Could you send a PR to my branch, or sketch it out in another commit? This would prevent merge conflicts on my branch

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Mar 21, 2025

sketch it out in another commit

Sorry, don't quite understand what it exactly means ....

Could you send a PR to my branch

We can use chi router's "RoutePattern", and make code testable.

@wxiaoguang wxiaoguang added the type/feature Completely new functionality. Can only be merged if feature freeze is not active. label Mar 21, 2025
@wxiaoguang wxiaoguang added this to the 1.24.0 milestone Mar 21, 2025
@bohde
Copy link
Contributor Author

bohde commented Apr 7, 2025

update: Close since there is no interest after 2 weeks

Just getting back to this now as I was dealing with other priorities. I could incorporate the above changes, but the near complete rewrite of my PR makes it difficult to isolate the changes you are asking, and makes it incompatible with the production implementation we are running. This makes it difficult to test these changes against the live traffic we are seeing to ensure they still perform well.

This adds a middleware for overload protection, that is intended to help protect against malicious scrapers. It does this by via  [`codel`](https://github.com/bohde/codel), which will perform the following:

1. Limit the number of in-flight requests to some user defined max
2. When in-flight requests have reached their max, begin queuing requests, with logged in requests having priority above logged out requests
3. Once a request has been queued for too long, it has a percentage chance to be rejected based upon how overloaded the entire system is.

When a server experiences more traffic than it  can handle, this has the effect of keeping latency low for logged in users, while rejecting just enough requests from logged out users to keep the service from being overloaded.

Below are benchmarks showing a system at 100% capacity and 200% capacity in a few different configurations. The 200% capacity is shown to highlight an extreme case. I used [hey](https://github.com/rakyll/hey) to simulate the bot traffic:

```
hey -z 1m -c 10 "http://localhost:3000/rowan/demo/issues?state=open&type=all&labels=&milestone=0&project=0&assignee=0&poster=0&q=fix"
```

The concurrency of 10 was chosen from experiments where my local server began to experience higher latency.

Results

| Method | Concurrency |  p95 latency | Successful RPS | Requests Dropped |
|--------|--------|--------|--------|--------|
| QoS Off | 10 | 0.2960s | 44 rps | 0% |
| QoS Off | 20 | 0.5667s | 44 rps | 0%|
| QoS On | 20 |  0.4409s | 48 rps | 10% |
| QoS On 50% Logged In* | 10 | 0.3891s | 33 rps | 7% |
| QoS On 50% Logged Out* | 10  | 2.0388s | 13 rps | 6% |

Logged in users were given the additional parameter ` -H "Cookie: i_like_gitea=<session>`.

Tests with `*` were run at the same time, representing a workload with mixed logged in & logged out users. Results are separated to show prioritization, and how logged in users experience a 100ms latency increase under load, compared to the 1.6 seconds logged out users see.
@bohde
Copy link
Contributor Author

bohde commented Apr 7, 2025

I applied review feedback, and synthesized into something that is easily backportable to 1.23 by avoiding APIs added in main. @wxiaoguang's PR did miss some important endpoints we sometimes see heavy traffic on since it didn't check every possible delimited portion of the subpath against the string in my original implementation. However, because the RoutePattern can be pulled directly in this middleware using @wxiaoguang's method, this can all be simplified to the following priority scheme:

Logged In Repo Context Priority
Yes - High
No Yes Low
No No Default

This allows key flows such as login and explore to still be prioritized well, with only the repo specific endpoints likely to be throttled. In my experience, this catches the vast majority of what scrapers are targeting.

@bohde bohde requested review from wxiaoguang and a1012112796 April 7, 2025 23:18
@bohde bohde requested a review from a team April 9, 2025 13:38
@wxiaoguang
Copy link
Contributor

wxiaoguang commented Apr 9, 2025

A question in my mind (not blocker):

Should Gitea report 503 error to real anonymous end users when the instance is being crawled heavily (service is over capacity)?

Disclaimer: I don't run a public instance, so maybe public instance site admins (like gitea.com) could help.

@lunny
Copy link
Member

lunny commented Apr 9, 2025

Alternatively, a configuration item may be introduced to return a 503 status code or redirect users to a custom URL.

@bohde
Copy link
Contributor Author

bohde commented Apr 9, 2025

Alternatively, a configuration item may be introduced to return a 503 status code or redirect users to a custom URL.

The fundamental problem here is that if we send a redirect instead of a 503, we now generate more load when the user follows the redirect. This causes us to generate more load, instead of removing load. This can cause feedback loops if this newly added load also generates more redirects.

@bohde
Copy link
Contributor Author

bohde commented Apr 9, 2025

Just to validate this, I prototyped the redirect method locally and ran benchmarks using hey, which follows redirects by default. The command I ran was:

hey -z 1m -c 20 "http://localhost:3000/rowan/demo/issues?state=open&type=all&labels=&milestone=0&project=0&assignee=0&poster=0&q=fix"

The results are below, but the redirect method shows worse performance on p95 latency than just not having QoS turned on, because of the added load redirects introduce. It does however show higher RPS (and mean latency), because rendering a login page is less intensive than the issue search.

Method Concurrency p95 latency Successful RPS Requests Dropped
QoS Off 20 0.4912s 50 rps 0%
QoS On (503) 20 0.4058s 63 rps 8%
QoS On (Redirect) 20 0.8329s 60 rps 0%

@silverwind
Copy link
Member

silverwind commented Apr 9, 2025

I recommend using status code 429 (Too Many Requests) instead of 503 (Service Unavailable) for any rate-limited responses. The intent is more clear with that code and it blames the client, not the server.

For bonus points, one could also send the Retry-After header to indicate when further requests may be accepted.

@bohde
Copy link
Contributor Author

bohde commented Apr 9, 2025

I recommend using status code 429 (Too Many Requests) instead of 503 (Service Unavailable) for any rate-limited responses. The intent is more clear with that code and it blames the client, not the server.

This isn't quite the right semantics. 429 Too Many Requests means "that the user has sent too many requests in a given amount of time", which is not necessarily the case here. For example, a user can on their first request to the server have that request rejected by this middleware. This is not an issue with the client, but the server, in that it has determined it cannot service the request.

For bonus points, one could also send the Retry-After header to indicate when further requests may be accepted.

This would also apply to 503 Service Unavailable, but any value here is not likely to have a positive effect. In the case of load shedding like this, that field is useful when the clients are cooperative with the server, willingly backing off for that time period. However, crawlers are not cooperative (e.g. they do not respect robots.txt), so it is unlikely they would respect this field.

Hypothetically, an operator of a public instance could implement this field and see how it affects crawlers, but the result would vary from crawler to crawler.

@GiteaBot GiteaBot added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Apr 9, 2025
@wxiaoguang
Copy link
Contributor

Alternatively, a configuration item may be introduced to return a 503 status code or redirect users to a custom URL.

The fundamental problem here is that if we send a redirect instead of a 503, we now generate more load when the user follows the redirect. This causes us to generate more load, instead of removing load. This can cause feedback loops if this newly added load also generates more redirects.

That's not true according to the feedbacks of the existing "REQUIRE_SIGNIN_VIEW=expensive" config option. All users said that there is no more CPU load.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Apr 9, 2025

The results are below, but the redirect method shows worse performance on p95 latency than just not having QoS turned on, because of the added load redirects introduce. It does however show higher RPS (and mean latency), because rendering a login page is less intensive than the issue search.

This test result doesn't reflect real user scenarios, can't be used to prove that "redirection can't be used".

There are just two requests for QoS On (Redirect) in your test so you see p95 latency becomes higher, but the "/user/login" page is quite fast and it won't be affected by crawlers. And when the crawlers see the 503 or user login page, it would just stop that traversal because there is no more valid targets for it.

In a real world, when your server load is high, more than 99% CPU is for rendering the "expensive" pages, responding 503 or "/user/login" has almost no difference.

@bohde
Copy link
Contributor Author

bohde commented Apr 10, 2025

That's not true according to the feedbacks of the existing `"REQUIRE_SIGNIN_VIEW=expensive" config option. All users said that there is no more CPU load.

The context for this statement is quite different from mine. "REQUIRE_SIGNIN_VIEW=expensive" makes all logged out requests to expensive pages redirect to a login page. This changes the work necessary to render one of these pages to performing a check if the user is logged in, and then serving the redirect. As long as this work is less than the work it would take otherwise render the page directly, this is a net win for every request. For example, if a server is serving 10 RPS to an endpoint, and an expensive page takes 500ms of CPU time to render, but the the login page render takes 100ms to render, this algorithm saves 10 * (500ms - 100ms) of work every second, or 4 CPU cores of work in savings. This is potentially a huge win for users, and directly aligns with the feedback from users.

This context does not apply to how this algorithm works though. This algorithm specifically tries to serve as much traffic as possible, while maintaining latency targets. To use the numbers above again, if we have 10 RPS to a 500ms endpoint, but only 2 cores, this algorithm would converge on 4 RPS of goodput, while rejecting 6 RPS.

Adding the redirect changes that math. Still using those same numbers, those 6 RPS that were formerly rejected now each generate another 100ms of work, at a default priority, which is higher than the low priority of the request that was initially turned into a redirect. This another 600ms of CPU time that is prioritized, only leaving 1.4 CPU seconds left to handle expensive requests, or slightly less than 3 RPS of goodput. This causes a feedback loop as well, where the reduced goodput then causes more redirects, but it will eventually converge to a lower amount of goodput on the server than if you served a 503.

I do feel that this is turning into bikeshedding. We've been running this algorithm in production for a couple of months now, and I can confirm that it significantly helps deal with crawlers that do not respect our robots.txt, while still allowing crawlers that do to use our site. Suggestions like redirecting to login are not something I've tested in a production environment, so I can't recommend that approach. There are certainly other steps that can be done to address crawlers, and some of them such as "REQUIRE_SIGNIN_VIEW=expensive" are complementary approaches.

@wxiaoguang
Copy link
Contributor

I do feel that this is turning into bikeshedding.

I do not see any bikeshedding, I just suggested to make the design complete and flexible. I didn't say any "no" to your solution

We've been running this algorithm in production for a couple of months now, and I can confirm that it significantly helps deal with crawlers

I agree, but it never answers the question above: "why a real anonymous end users should see a 503 error when the site is heavily being crawled?" My suggestion is to let site admin decide, you could still use 503 if you like, others could use redirection if they like.

@wxiaoguang
Copy link
Contributor

To be clear: if you have no objection, we can merge your solution as-is, and I will complete it and make it configurable to make site admin could choose the behavior they need.

@bohde
Copy link
Contributor Author

bohde commented Apr 10, 2025

To be clear: if you have no objection, we can merge your solution as-is, and I will complete it and make it configurable to make site admin could choose the behavior they need.

I think this solution is complete as is, and adding a redirect will have complex and unintended consequences to the algorithm that I detailed in #33951 (comment). In previous production systems I've worked on, including those that used the library this is based on, I've seen slight tweaks to behavior such as the proposed redirect behavior cause feedback loops, which in turn cause worse outages than if the solution hadn't been in place. I always prefer to simplify the behavior in order to make it easy to reason about, and only adjust based upon production experience.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Apr 10, 2025

I do not know why you have objection to let site admin choose the behavior they want.

But actually I am sure:

  1. You could still use the 503 behavior (blank page) by my proposal.
  2. You haven't tried the user-redirection behavior by my proposal in Gitea. I am pretty sure the result should be the same as your 503 behavior (CPU load). You can try it in your production to see if I was wrong. If I am wrong, I am happy to accept the conclusion and learn why.
    • this user login page could also respond 503 to make the crawlers could still use the site, the same as a blank 503 page.
  3. You haven't tried the Proof-of-Work or rate-limit-token behavior by my proposal. I am pretty sure it will provide the best user experience among these behaviors.
    • the response code could still be 503, to make the crawlers could still use the site, while real anonymous users won't be affected too much.

@bohde
Copy link
Contributor Author

bohde commented Apr 10, 2025

You haven't tried the user-redirection behavior by my proposal in Gitea. I am pretty sure the result should be the same as your 503 behavior (CPU load). You can try it in your production to see if I was wrong. If I am wrong, I am happy to accept the conclusion and learn why.

I haven't seen your proposed code, but I have prototyped a version of it by serving a redirect to the login page and testing the behavior in local benchmarks in #33951 (comment), and it is worse than the 503 method, because it now needs to also handle the login page request. These results are inline with my intuition of how it works in #33951 (comment). Ultimately, I can only ever see this introducing a regression in both performance and latency in a production environment because it strictly adds work that a server needs to do.

If I understand you, you think the added redirect will provide a nicer user experience, while the additional load of the login page will not meaningfully have a regression. Because this algorithm only ever rejects traffic when the server is already overloaded, any load added at this point, even if it is a small, makes this overloaded status worse, and causes more traffic to be rejected. This is a similar, but slightly different, failure mode as a retry storm.

You haven't tried the Proof-of-Work or rate-limit-token behavior by my proposal. I am pretty sure it will provide the best user experience among these behaviors.

I'd certainly look at these, but I don't the goals of those are necessarily relevant to this PR. My goal in this PR is to keep friction low for anonymous users and crawlers that respect our robots.txt, which is why it only drops traffic if it needs to.

@wxiaoguang
Copy link
Contributor

I haven't seen your proposed code,

If you have no objection, I will propose one.

but I have prototyped a version of it by serving a redirect to the login page and testing the behavior in local benchmarks in #33951 (comment), and it is worse than the 503 method, because it now needs to also handle the login page request.

I think I have explained above: the test is just a test, production result speaks.

I'd certainly look at these, but I don't the goals of those are necessarily relevant to this PR. My goal in this PR is to keep friction low for anonymous users and crawlers that respect our robots.txt, which is why it only drops traffic if it needs to.

That's what I suggested above: I will complete it and make it configurable to make site admin could choose the behavior they need.


I think I could propose a "Proof-of-Work or rate-limit-token" solution and you could try it in your production to see whether it is better.

@bohde
Copy link
Contributor Author

bohde commented Apr 10, 2025

If you have no objection, I will propose one.

If you're fine merging this one, I would look at a follow up.

I think I could propose a "Proof-of-Work or rate-limit-token" solution and you could try it in your production to see whether it is better.

I would however need to see the proposed code before I could determine if I could test this in a production environment. Either way, I'm happy to take a look.

@wxiaoguang
Copy link
Contributor

If you have no objection, I will propose one.

If you're fine merging this one, I would look at a follow up.

Just like I said: I have no objection to this, public site admin maintainers could help, maybe: @techknowlogick @lunny

@wxiaoguang
Copy link
Contributor

I would however need to see the proposed code before I could determine if I could test this in a production environment. Either way, I'm happy to take a look.

OK, I will propose one (maybe replace this one while achieve the same or better result)

@bohde
Copy link
Contributor Author

bohde commented Apr 10, 2025

I would however need to see the proposed code before I could determine if I could test this in a production environment. Either way, I'm happy to take a look.

OK, I will propose one (maybe replace this one while achieve the same or better result)

I would need to see it of course, but I don't think the goals described in #33966 (comment) are not the same as this one. I think they may be complementary, but in our case we have public repos that we want both indexed by search engines and to be accessible by users who are not logged in, and want to keep friction as low as possible for those cases.

I would prefer to merge this as is, and address any others in a follow up.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Apr 10, 2025

-> Rate-limit anonymous accesses to expensive pages when the server is overloaded #34167

I think this one covers all the cases (ps: benchmark and local tests aren't the crawler's behavior, so I think production result speaks, but I don't have a public instance)

@bohde
Copy link
Contributor Author

bohde commented Apr 10, 2025

-> Rate-limit anonymous accesses to expensive pages when the server is overloaded #34167

I think this one covers all the cases.

The algorithm there is clearly a derivative of the one I am proposing here, but it rejects more requests in practice since it does not enqueue requests. It also takes part of this PR, and introduces it in a new one, while stripping my authorship. This is evident in details such as the default config setting, which is copy pasted from this PR, but really only makes sense in the context of this algorithm.

I've done a lot of legwork on this PR, including testing, benchmarking and refining it on production crawler traffic. I'm always open to feedback and ideas on how to improve an implementation, but have found this process very frustrating, from the initial dismissive response, the ongoing bikeshedding, and now the rewrite while stripping my authorship.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-update-needed The document needs to be updated synchronously lgtm/need 1 This PR needs approval from one additional maintainer to be merged. modifies/dependencies modifies/go Pull requests that update Go code size/L Denotes a PR that changes 100-499 lines, ignoring generated files. type/feature Completely new functionality. Can only be merged if feature freeze is not active.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants