Skip to content

OOM caused by numerous crawls #33520

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
H0llyW00dzZ opened this issue Feb 6, 2025 · 22 comments
Open

OOM caused by numerous crawls #33520

H0llyW00dzZ opened this issue Feb 6, 2025 · 22 comments
Labels
issue/needs-feedback For bugs, we need more details. For features, the feature must be described in more detail

Comments

@H0llyW00dzZ
Copy link

H0llyW00dzZ commented Feb 6, 2025

Description

In the latest versions, 1.23.2 and 1.23.3, memory leaks occur. (update: see below, not memory leak, not regression)

These OOMs are caused by numerous crawls, such as those used by Facebook Inc. (Meta), Amazon (AWS), and other entities that fetch data excessively for AI training.

My Gitea self-hosted configuration:

  • Sessions using files
  • Cache using Redis with a TTL of 5 hours, and the last commit cache is 10K
  • No SSH

Screenshots

Image
Image

The logs exemplify how these companies use crawls for their AI.

Image

Essentially, memory leaks occur when there are many fetch requests, leading to crashes due to excessive memory consumption (thanks to OOM Kubernetes).

@wxiaoguang
Copy link
Contributor

Could you download a diagnosis report from "admin panel -> monitor -> trace" when the memory goes high?

The report contains heap dump (no sensitive data) and could help to locate the problem.

@H0llyW00dzZ
Copy link
Author

Could you download a diagnosis report from "admin panel -> monitor -> trace" when the memory goes high?

The report contains heap dump (no sensitive data) and could help to locate the problem.

Here is the system notice:

Image

This is the system status, which shows an inconsistent system status, as I mentioned earlier in #33311.

Image

@wxiaoguang
Copy link
Contributor

Could you download a diagnosis report from "admin panel -> monitor -> trace" when the memory goes high?

The report contains heap dump (no sensitive data) and could help to locate the problem.

@wxiaoguang
Copy link
Contributor

If the memory is not related to Gitea process, then maybe you need to figure out which process consumes that memory, for example: git process? or some other commands?

@H0llyW00dzZ
Copy link
Author

Could you download a diagnosis report from "admin panel -> monitor -> trace" when the memory goes high?

The report contains heap dump (no sensitive data) and could help to locate the problem.

I can't capture the memory usage when it spikes via the trace admin panel because every time memory consumption goes high (e.g., 7 GiB), it crashes due to OOM Kubernetes.

@wxiaoguang
Copy link
Contributor

Could you download a diagnosis report from "admin panel -> monitor -> trace" when the memory goes high?
The report contains heap dump (no sensitive data) and could help to locate the problem.

I can't capture the memory usage when it spikes via the trace admin panel because every time memory consumption goes high (e.g., 7 GiB), it crashes due to OOM Kubernetes.

Is it clear that which process consumes that much memory? The Gitea web server process itself, or other processes like "ssh" or "git" or "gitea serve/hook"?

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Feb 6, 2025

The logs exemplify how these companies use crawls for their AI.

Essentially, memory leaks occur when there are many fetch requests, leading to crashes due to excessive memory consumption (thanks to OOM Kubernetes).

If the OOM is caused by crawls, then it isn't a regression: each request consumes memory, some large repo/files consume more, then if there are lot of requests, these requests do consume a lot of memory and would lead to OOM. Maybe you could try to make stop the crawls and/or require sign-in for your instance.

So I think we need to make the problem clearer.

@wxiaoguang wxiaoguang changed the title Memory Leaks in Versions 1.23.2 and 1.23.3 OOM caused by numerous crawls Feb 6, 2025
@H0llyW00dzZ
Copy link
Author

H0llyW00dzZ commented Feb 6, 2025

Could you download a diagnosis report from "admin panel -> monitor -> trace" when the memory goes high?
The report contains heap dump (no sensitive data) and could help to locate the problem.

I can't capture the memory usage when it spikes via the trace admin panel because every time memory consumption goes high (e.g., 7 GiB), it crashes due to OOM Kubernetes.

Is it clear that which process consumes that much memory? The Gitea web server process itself, or other processes like "ssh" or "git" or "gitea serve/hook"?

Most likely, it's from Git because the stack trace shows this:

Image

Image

Image

When there are many requests, such as GET requests to view repositories from crawls, memory consumption goes high, and it crashes due to being OOM killed by Kubernetes.

@H0llyW00dzZ
Copy link
Author

Also right now, I've rolled back to version 1.23.1 and reduced the cache for last commit messages from 10K to 5K in the app.ini configuration. Let's see if it still crashes.

@wxiaoguang
Copy link
Contributor

TBH, I do not see related change between 1.23.1 ~ 1.23.3

v1.23.1...v1.23.3

@H0llyW00dzZ
Copy link
Author

TBH, I do not see related change between 1.23.1 ~ 1.23.3

v1.23.1...v1.23.3

Well It worked fine for me previously, with uptime of over a month without crashing due to high memory consumption.

And now, after rolling back, it still crashes.

h0llyw00dzz@ubuntu-pro:~$ kubectl get pods -n gitea
NAME                     READY   STATUS    RESTARTS      AGE
gitea-5cb7dff998-xwb5r   1/1     Running   1 (40s ago)   10m

h0llyw00dzz@ubuntu-pro:~$ kubectl describe pods -n gitea
Containers:
  gitea:
    Container ID:   containerd://866d173132606a07e7937e7dfb430533cf1e5a8ad515044e496486416f6a485c
    Image:          gitea/gitea:1.23.1
    Image ID:       docker.io/gitea/gitea@sha256:c3be67d5c31694f8c27e5f3ab87630cceadf05abb795ab0ed70ba14b5edfc29c
    Port:           3000/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Thu, 06 Feb 2025 18:30:03 +0700
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 06 Feb 2025 18:20:10 +0700
      Finished:     Thu, 06 Feb 2025 18:30:01 +0700

@wxiaoguang
Copy link
Contributor

Well, as I said above: it can't be a regression, it can't be related to the new version.

There are just more crawls now. If you do not have that much resource support the crawls, maybe you need to block the crawls.

@H0llyW00dzZ
Copy link
Author

Well, as I said above: it can't be a regression, it can't be related to the new version.

There are just more crawls now. If you do not have that much resource support the crawls, maybe you need to block the crawls.

For now, I've enabled REQUIRE_SIGNIN_VIEW to disable crawls used by companies like Facebook (Meta) and Amazon (AWS) for training their AI. It seems they are likely overusing (Abuse) the crawls for AI purposes.

Blocking these crawls by IP is ineffective because their IPs frequently change.

@H0llyW00dzZ
Copy link
Author

@wxiaoguang

The problem was solved by blocking their ASN, likely used for abusive AI training (e.g., Facebook Inc. (Meta), Amazon (AWS)). Now, only crawls from Google, used for indexing in their search engine, are allowed via Kubernetes Ingress Nginx. However, I believe it would be beneficial to expand the admin panel with additional features to block crawls based on IPs, User-Agent, and ASN. This would help prevent high memory consumption, likely due to memory leaks, which can cause crashes.

@H0llyW00dzZ
Copy link
Author

The proof that blocking bad crawls used by Facebook Inc. (Meta) and Amazon (AWS) for AI training has effectively solved the memory usage issue, which was previously being abused excessively for profit.

Image

Note

Memory usage has returned to normal, even with legitimate crawls like Google Search and others used for SEO, unlike the abusive AI training crawls from large companies such as Facebook Inc. (Meta) and Amazon (AWS).

@H0llyW00dzZ
Copy link
Author

H0llyW00dzZ commented Feb 15, 2025

@wxiaoguang I've resolved this problem by increasing the Redis cache pool size to 500 and switching the session storage from files to Redis, using the same pool size of 500. This results in a total pool size of 1000.

The Stats:

Redis:
Image

Pods:
Image
Image
Image

However, this solution is only temporary because, without Redis, the memory usage leads to excessive consumption.

@wxiaoguang
Copy link
Contributor

In 1.23.7 , we have this:

Add a config option to block "expensive" pages for anonymous users (#34024) (#34071)

@wxiaoguang wxiaoguang added the issue/needs-feedback For bugs, we need more details. For features, the feature must be described in more detail label Apr 9, 2025
@H0llyW00dzZ
Copy link
Author

H0llyW00dzZ commented Apr 9, 2025

In 1.23.7 , we have this:

Add a config option to block "expensive" pages for anonymous users (#34024) (#34071)

@wxiaoguang, I've been trying that configuration option, but it seems similar to REQUIRE_SIGNIN_VIEW = true, which may not be ideal for open-source repositories. I think it would be more effective to implement a rate limiter based on IP addresses or user agents, or both, for areas that consume a lot of memory (e.g., example.com/repo/commit/sha1commit). This could reduce resource usage, such as memory, especially since many AI crawlers use the same IPs and user agents when crawling a site.

@wxiaoguang wxiaoguang removed the issue/needs-feedback For bugs, we need more details. For features, the feature must be described in more detail label Apr 9, 2025
@wxiaoguang
Copy link
Contributor

which may not be ideal for open-source repositories.

For "open source public site", my proposal is https://github.com/go-gitea/gitea/pull/33951#discussion_r2032324964

I don't run a public site, so I can't comment too much for this problem.

@H0llyW00dzZ
Copy link
Author

which may not be ideal for open-source repositories.

For "open source public site", my proposal is https://github.com/go-gitea/gitea/pull/33951#discussion_r2032324964

I don't run a public site, so I can't comment too much for this problem.

@wxiaoguang, I run a public site primarily for mirroring repositories. Also the implementation of #33951 could indeed help reduce resource usage. It's quite similar to a rate limiter, which would be beneficial in managing resource consumption effectively.

@wxiaoguang
Copy link
Contributor

#33951 has been merged, does it work for your case?

@wxiaoguang wxiaoguang added the issue/needs-feedback For bugs, we need more details. For features, the feature must be described in more detail label Apr 20, 2025
@H0llyW00dzZ
Copy link
Author

#33951 has been merged, does it work for your case?

@wxiaoguang I haven't tried it yet. My git site is using Gitea 1.23.7, not the nightly build, as I prefer long-term stability due to its running on k8s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
issue/needs-feedback For bugs, we need more details. For features, the feature must be described in more detail
Projects
None yet
Development

No branches or pull requests

2 participants