-
Notifications
You must be signed in to change notification settings - Fork 529
fix: osv data source memory consumption #4956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This reverts commit 28eb7ac.
@terriko I believe all major memory fixes have been implemented in this PR for the OSV data source. Now the cve-bin-tool consumes around 3 GB at peak during database updates with all data sources enabled. Additionally, I think we could improve memory performance in other data sources that read content from disk by using generators, rather than keeping the content of all files in RAM. There's no need to fully rewrite sources like OSV, so we could potentially reduce memory consumption even further with small changes. |
This is very exciting, thank you for working on it. getting rid of gsutil entirely would be amazing; it has a really large effect on our dependencies and causes me kind of an ongoing hassle as a result. Of course, I have no idea if the new library is just as bad. I've enabled tests to run, but quick heads up that because we're changing dependencies, I need to run licenses through legal before this can be merged, so it's likely to languish for a while until I get around to doing that paperwork. |
Fixes #4710,
Currently, the get_ecosystem_incremental method uses gsutil to fetch info about new files in the ecosystem. However, this method is executed concurrently for each item in the ecosystem list (there are around 426 items), while other data sources are also updating, which led to the launch of the OOM Killer. This is why gsutil was replaced with the google-cloud-storage library, which allows iterating over the data without fetching it fully.