Implement `ArchiveVersionDownloads` background job #8596

Turbo87 · 2024-05-06T17:38:40Z

This PR implements a solution for #3479. The new background job can be used to export all version download data to S3 and then remove it from the database. This should allow us to shrink the database considerably, which hopefully might have a positive effect on our database performance.

codecov · 2024-05-06T17:51:38Z

Codecov Report

Attention: Patch coverage is 84.11215% with 51 lines in your changes are missing coverage. Please review.

Project coverage is 88.43%. Comparing base (180c7b2) to head (3e730b9).

Files	Patch %	Lines
src/worker/jobs/archive_version_downloads.rs	85.89%	44 Missing ⚠️
src/admin/enqueue_job.rs	0.00%	5 Missing ⚠️
src/bin/background-worker.rs	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8596      +/-   ##
==========================================
- Coverage   88.50%   88.43%   -0.07%     
==========================================
  Files         274      275       +1     
  Lines       26938    27257     +319     
==========================================
+ Hits        23841    24105     +264     
- Misses       3097     3152      +55

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

eth3lbert · 2024-05-06T22:43:45Z

Do you think generating a series of dates from date_start to current_date - 90 and then looping through them to export as CSV could be a possible solution? This approach came to mind as it's relatively simple. It would only require keeping track of date_start and wouldn't necessitate splitting the CSV file.

Turbo87 · 2024-05-07T08:04:51Z

Do you think generating a series of dates from date_start to current_date - 90 and then looping through them to export as CSV could be a possible solution?

it is a possible solution, but a) figuring out that date_start value and b) exporting per date are quite slow queries and it seemed faster just export a single file and perform the post-processing on the Rust side instead. admittedly the post-processing step isn't particularly fast either, but from the limited amount of testing I've performed it seemed like the one export query + post-process approach would be faster than the date_start query + N export queries approach.

eth3lbert · 2024-05-07T16:42:19Z

Thanks for breaking it down!

it is a possible solution, but a) figuring out that date_start value

The simplest solution would be to simply start from the date that crates.io service was activated if there is no date_start recorded. The date_start should then be updated and recorded every time the job is processed.

b) exporting per date are quite slow queries and it seemed faster just export a single file and perform the post-processing on the Rust side instead. admittedly the post-processing step isn't particularly fast either, but from the limited amount of testing I've performed it seemed like the one export query + post-process approach would be faster than the date_start query + N export queries approach.

I can't definitively conclude the cause of the slowness without knowing the query's execution plan. However, I suspect it's because the query isn't utilizing the index.

I'm unsure whether using the following query structure would be beneficial:

where version_id in (select id from versions) and date = 'some_date'

If performance remains poor when filtering with a single date (although this should improve once unneeded records are archived), then the proposed splitting CSV file approach makes perfect sense to me.

LawnGnome

OK, so as implemented, this LGTM.

Did you test shelling out to psql against running the query and writing the CSV files directly from within the background worker? To me that feels like the simpler approach — and would give us some potential benefits around being able to order and/or chunk the dates that we're extracting, which would limit how many open files we'd need — which makes me suspect that there's a reason for \copy that I'm not seeing.

Approving to unblock this, but I'm curious about the above. (And have probably just missed some context while I was in the land of the 🦘!)

src/worker/jobs/archive_version_downloads.rs

LawnGnome · 2024-05-13T17:53:30Z

src/worker/jobs/archive_version_downloads.rs

+    let parent_path = path.parent().ok_or_else(|| anyhow!("Invalid path"))?;
+
+    let mut reader = csv::Reader::from_path(path)?;
+    let mut writers: BTreeMap<Vec<u8>, _> = BTreeMap::new();


Looks like standard Heroku dynos have a 10k open file descriptor limit.

This shouldn't be an issue for many years, but having one file descriptor open per historical day plus whatever other work is happening in the background worker is a potential footgun.

it would be an issue if the historical days would stay in the database, but once we have exported the historical data to S3 we will no longer have this issue, because at most we would be exporting a couple of days and most likely only a single day.

src/worker/jobs/archive_version_downloads.rs

Turbo87 · 2024-05-14T12:37:17Z

Did you test shelling out to psql against running the query and writing the CSV files directly from within the background worker?

yeah, I tried a native COPY command via diesel::sql_query(), but that complained about not being able to access the file and the copy functionality in diesel itself is not yet released AFAIK. since we were already using something roughly similar for the database dump I figured this should work well enough.

one potential alternative would be COPY ... TO stdout and then handling the stdout stream for the child process, but I'm not convinced that that would be less complex than the current solution 😅

Turbo87 added C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works A-backend ⚙️ labels May 6, 2024

Turbo87 requested a review from a team May 6, 2024 17:38

Turbo87 force-pushed the archive-version-downloads branch from eb511c5 to 1fe523a Compare May 6, 2024 17:43

Turbo87 force-pushed the archive-version-downloads branch from 1fe523a to bdd4acf Compare May 7, 2024 13:38

LawnGnome approved these changes May 13, 2024

View reviewed changes

Turbo87 force-pushed the archive-version-downloads branch from bdd4acf to 9bfdca3 Compare May 14, 2024 12:37

Implement ArchiveVersionDownloads background job

3e730b9

Turbo87 force-pushed the archive-version-downloads branch from 9bfdca3 to 3e730b9 Compare May 14, 2024 12:39

Turbo87 enabled auto-merge (squash) May 14, 2024 12:41

Turbo87 merged commit 78e39c5 into rust-lang:main May 14, 2024
9 checks passed

Turbo87 deleted the archive-version-downloads branch May 14, 2024 12:49

Turbo87 mentioned this pull request May 16, 2024

Archive old entries in version_downloads table #3479

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `ArchiveVersionDownloads` background job #8596

Implement `ArchiveVersionDownloads` background job #8596

Turbo87 commented May 6, 2024

codecov bot commented May 6, 2024 •

edited

Loading

eth3lbert commented May 6, 2024

Turbo87 commented May 7, 2024

eth3lbert commented May 7, 2024

LawnGnome left a comment

LawnGnome May 13, 2024

Turbo87 May 14, 2024

Turbo87 commented May 14, 2024

Implement ArchiveVersionDownloads background job #8596

Implement ArchiveVersionDownloads background job #8596

Conversation

Turbo87 commented May 6, 2024

codecov bot commented May 6, 2024 • edited Loading

Codecov Report

eth3lbert commented May 6, 2024

Turbo87 commented May 7, 2024

eth3lbert commented May 7, 2024

LawnGnome left a comment

Choose a reason for hiding this comment

LawnGnome May 13, 2024

Choose a reason for hiding this comment

Turbo87 May 14, 2024

Choose a reason for hiding this comment

Turbo87 commented May 14, 2024

Implement `ArchiveVersionDownloads` background job #8596

Implement `ArchiveVersionDownloads` background job #8596

codecov bot commented May 6, 2024 •

edited

Loading