-
Notifications
You must be signed in to change notification settings - Fork 639
Implement ArchiveVersionDownloads
background job
#8596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
eb511c5
to
1fe523a
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8596 +/- ##
==========================================
- Coverage 88.50% 88.43% -0.07%
==========================================
Files 274 275 +1
Lines 26938 27257 +319
==========================================
+ Hits 23841 24105 +264
- Misses 3097 3152 +55 ☔ View full report in Codecov by Sentry. |
Do you think generating a series of dates from |
it is a possible solution, but a) figuring out that |
1fe523a
to
bdd4acf
Compare
Thanks for breaking it down!
The simplest solution would be to simply start from the date that crates.io service was activated if there is no
I can't definitively conclude the cause of the slowness without knowing the query's execution plan. However, I suspect it's because the query isn't utilizing the index. I'm unsure whether using the following query structure would be beneficial: where version_id in (select id from versions) and date = 'some_date' If performance remains poor when filtering with a single date (although this should improve once unneeded records are archived), then the proposed splitting CSV file approach makes perfect sense to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so as implemented, this LGTM.
Did you test shelling out to psql
against running the query and writing the CSV files directly from within the background worker? To me that feels like the simpler approach — and would give us some potential benefits around being able to order and/or chunk the dates that we're extracting, which would limit how many open files we'd need — which makes me suspect that there's a reason for \copy
that I'm not seeing.
Approving to unblock this, but I'm curious about the above. (And have probably just missed some context while I was in the land of the 🦘!)
let parent_path = path.parent().ok_or_else(|| anyhow!("Invalid path"))?; | ||
|
||
let mut reader = csv::Reader::from_path(path)?; | ||
let mut writers: BTreeMap<Vec<u8>, _> = BTreeMap::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like standard Heroku dynos have a 10k open file descriptor limit.
This shouldn't be an issue for many years, but having one file descriptor open per historical day plus whatever other work is happening in the background worker is a potential footgun.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be an issue if the historical days would stay in the database, but once we have exported the historical data to S3 we will no longer have this issue, because at most we would be exporting a couple of days and most likely only a single day.
yeah, I tried a native one potential alternative would be |
bdd4acf
to
9bfdca3
Compare
9bfdca3
to
3e730b9
Compare
This PR implements a solution for #3479. The new background job can be used to export all version download data to S3 and then remove it from the database. This should allow us to shrink the database considerably, which hopefully might have a positive effect on our database performance.