Skip to content

fix: osv data source memory consumption #4956

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

fil1n
Copy link
Contributor

@fil1n fil1n commented Mar 19, 2025

Fixes #4710,

Currently, the get_ecosystem_incremental method uses gsutil to fetch info about new files in the ecosystem. However, this method is executed concurrently for each item in the ecosystem list (there are around 426 items), while other data sources are also updating, which led to the launch of the OOM Killer. This is why gsutil was replaced with the google-cloud-storage library, which allows iterating over the data without fetching it fully.

@fil1n
Copy link
Contributor Author

fil1n commented Mar 23, 2025

@terriko I believe all major memory fixes have been implemented in this PR for the OSV data source. Now the cve-bin-tool consumes around 3 GB at peak during database updates with all data sources enabled. Additionally, I think we could improve memory performance in other data sources that read content from disk by using generators, rather than keeping the content of all files in RAM. There's no need to fully rewrite sources like OSV, so we could potentially reduce memory consumption even further with small changes.

@terriko
Copy link
Contributor

terriko commented Mar 24, 2025

This is very exciting, thank you for working on it. getting rid of gsutil entirely would be amazing; it has a really large effect on our dependencies and causes me kind of an ongoing hassle as a result. Of course, I have no idea if the new library is just as bad.

I've enabled tests to run, but quick heads up that because we're changing dependencies, I need to run licenses through legal before this can be merged, so it's likely to languish for a while until I get around to doing that paperwork.

@terriko terriko added the dependencies Pull requests that update a dependency file label Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: out of memory during OSV database load
2 participants