Skip to content

Implement ArchiveVersionDownloads background job #8596

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 14, 2024

Conversation

Turbo87
Copy link
Member

@Turbo87 Turbo87 commented May 6, 2024

This PR implements a solution for #3479. The new background job can be used to export all version download data to S3 and then remove it from the database. This should allow us to shrink the database considerably, which hopefully might have a positive effect on our database performance.

@Turbo87 Turbo87 added C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works A-backend ⚙️ labels May 6, 2024
@Turbo87 Turbo87 requested a review from a team May 6, 2024 17:38
@Turbo87 Turbo87 force-pushed the archive-version-downloads branch from eb511c5 to 1fe523a Compare May 6, 2024 17:43
Copy link

codecov bot commented May 6, 2024

Codecov Report

Attention: Patch coverage is 84.11215% with 51 lines in your changes are missing coverage. Please review.

Project coverage is 88.43%. Comparing base (180c7b2) to head (3e730b9).

Files Patch % Lines
src/worker/jobs/archive_version_downloads.rs 85.89% 44 Missing ⚠️
src/admin/enqueue_job.rs 0.00% 5 Missing ⚠️
src/bin/background-worker.rs 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8596      +/-   ##
==========================================
- Coverage   88.50%   88.43%   -0.07%     
==========================================
  Files         274      275       +1     
  Lines       26938    27257     +319     
==========================================
+ Hits        23841    24105     +264     
- Misses       3097     3152      +55     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@eth3lbert
Copy link
Contributor

Do you think generating a series of dates from date_start to current_date - 90 and then looping through them to export as CSV could be a possible solution? This approach came to mind as it's relatively simple. It would only require keeping track of date_start and wouldn't necessitate splitting the CSV file.

@Turbo87
Copy link
Member Author

Turbo87 commented May 7, 2024

Do you think generating a series of dates from date_start to current_date - 90 and then looping through them to export as CSV could be a possible solution?

it is a possible solution, but a) figuring out that date_start value and b) exporting per date are quite slow queries and it seemed faster just export a single file and perform the post-processing on the Rust side instead. admittedly the post-processing step isn't particularly fast either, but from the limited amount of testing I've performed it seemed like the one export query + post-process approach would be faster than the date_start query + N export queries approach.

@Turbo87 Turbo87 force-pushed the archive-version-downloads branch from 1fe523a to bdd4acf Compare May 7, 2024 13:38
@eth3lbert
Copy link
Contributor

Thanks for breaking it down!

it is a possible solution, but a) figuring out that date_start value

The simplest solution would be to simply start from the date that crates.io service was activated if there is no date_start recorded. The date_start should then be updated and recorded every time the job is processed.

b) exporting per date are quite slow queries and it seemed faster just export a single file and perform the post-processing on the Rust side instead. admittedly the post-processing step isn't particularly fast either, but from the limited amount of testing I've performed it seemed like the one export query + post-process approach would be faster than the date_start query + N export queries approach.

I can't definitively conclude the cause of the slowness without knowing the query's execution plan. However, I suspect it's because the query isn't utilizing the index.

I'm unsure whether using the following query structure would be beneficial:

where version_id in (select id from versions) and date = 'some_date'

If performance remains poor when filtering with a single date (although this should improve once unneeded records are archived), then the proposed splitting CSV file approach makes perfect sense to me.

Copy link
Contributor

@LawnGnome LawnGnome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so as implemented, this LGTM.

Did you test shelling out to psql against running the query and writing the CSV files directly from within the background worker? To me that feels like the simpler approach — and would give us some potential benefits around being able to order and/or chunk the dates that we're extracting, which would limit how many open files we'd need — which makes me suspect that there's a reason for \copy that I'm not seeing.

Approving to unblock this, but I'm curious about the above. (And have probably just missed some context while I was in the land of the 🦘!)

let parent_path = path.parent().ok_or_else(|| anyhow!("Invalid path"))?;

let mut reader = csv::Reader::from_path(path)?;
let mut writers: BTreeMap<Vec<u8>, _> = BTreeMap::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like standard Heroku dynos have a 10k open file descriptor limit.

This shouldn't be an issue for many years, but having one file descriptor open per historical day plus whatever other work is happening in the background worker is a potential footgun.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be an issue if the historical days would stay in the database, but once we have exported the historical data to S3 we will no longer have this issue, because at most we would be exporting a couple of days and most likely only a single day.

@Turbo87
Copy link
Member Author

Turbo87 commented May 14, 2024

Did you test shelling out to psql against running the query and writing the CSV files directly from within the background worker?

yeah, I tried a native COPY command via diesel::sql_query(), but that complained about not being able to access the file and the copy functionality in diesel itself is not yet released AFAIK. since we were already using something roughly similar for the database dump I figured this should work well enough.

one potential alternative would be COPY ... TO stdout and then handling the stdout stream for the child process, but I'm not convinced that that would be less complex than the current solution 😅

@Turbo87 Turbo87 force-pushed the archive-version-downloads branch from bdd4acf to 9bfdca3 Compare May 14, 2024 12:37
@Turbo87 Turbo87 force-pushed the archive-version-downloads branch from 9bfdca3 to 3e730b9 Compare May 14, 2024 12:39
@Turbo87 Turbo87 enabled auto-merge (squash) May 14, 2024 12:41
@Turbo87 Turbo87 merged commit 78e39c5 into rust-lang:main May 14, 2024
9 checks passed
@Turbo87 Turbo87 deleted the archive-version-downloads branch May 14, 2024 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-backend ⚙️ C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants