Skip to content

Archive old entries in version_downloads table #3479

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jtgeibel opened this issue Apr 1, 2021 · 12 comments
Closed

Archive old entries in version_downloads table #3479

jtgeibel opened this issue Apr 1, 2021 · 12 comments
Labels
A-backend ⚙️ C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear

Comments

@jtgeibel
Copy link
Member

jtgeibel commented Apr 1, 2021

We should find a place to archive daily download counts, and drop old entries from version_downloads. We only ever query for the last 90 days of recent downloads. We could upload a CSV of the previous day's downloads to S3 as part of a daily background job.

Currently, the version_downloads table consumes 4241 MB and its primary key index consumes 1825 MB. Reducing the size of this table should greatly reduce cache pressure on our database sever (with 4GB of RAM) and will make the size of our experimental database dumps much more practical.

@jtgeibel jtgeibel added the C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works label Apr 1, 2021
@Turbo87 Turbo87 added C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear A-backend ⚙️ and removed C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works labels Apr 4, 2021
bors added a commit that referenced this issue May 14, 2021
…, r=pietroalbini

Include only the last 90 days of downloads in our database dumps

In #3479 we plan to drop old entries and archive them in some other way, so old entries will eventually disappear from dumps anyway. This should make use of the database dumps much more practical for daily use. I think it would be reasonable to even limit this to the past week of data.

r? `@pietroalbini`
cc `@kornelski,` #2078
@Turbo87
Copy link
Member

Turbo87 commented Sep 17, 2021

Summary from the team meeting today:

  • we already only export the last 90 days in the database dump
  • we need a way to convert older entries in the database to CSV files (or similar)
  • we need a public place to store these exported files
  • afterwards we can remove the old data from the database

@nikhilpatel0
Copy link

Any update on this, Specifically how to access older data?

@dtolnay
Copy link
Member

dtolnay commented Aug 10, 2022

I can provide a complete version_downloads table dating back to 2014-11-11 to whoever sends me a preferred way to share a large file. Currently the csv is 120,554,803 rows, 2.3GB, gzipped is 361MB.

@nikhilpatel0
Copy link

Thanks for your help. If you can upload it to any file sharing service, that would be really helpful. Gdrive/Dropbox/Onedrive or any service you prefer.

@dtolnay
Copy link
Member

dtolnay commented Aug 10, 2022

https://send.vis.ee/download/6030078658da7a07/#QzIAS1VImWg0p5WfEAi9Dw

$ zcat version_downloads.csv.gz | (head; tail)
date,version_id,downloads
2014-11-11,6,7
2014-11-11,9,1
2014-11-11,10,1
2014-11-11,12,1
2014-11-11,13,1
2014-11-11,15,1
2014-11-11,16,1
2014-11-11,17,1
2014-11-11,20,1
2022-08-10,599691,6
2022-08-10,599692,6
2022-08-10,599693,6
2022-08-10,599694,4
2022-08-10,599695,5
2022-08-10,599696,5
2022-08-10,599697,5
2022-08-10,599698,4
2022-08-10,599699,4
2022-08-10,599700,4

Data from the last day is obviously partial because the day is not over yet.

@nikhilpatel0
Copy link

@dtolnay Thank you so much!!

@tlee0818
Copy link

@dtolnay

Hi! My name is Tak-Ho Lee, and I am conducting research on open-source sustainability at the School of Computer Science at CMU, under Dr. Christian Kaestner. Carol Nichols directed me here.

I want to gather project data, including the repository link, download counts, etc. As the issue mentions, the DB dump only has the past 90 days, so I was wondering if I could receive the CSV you're hosting (the link has expired it says).

@wschuell
Copy link

Hi @tlee0818 , we just published a dataset for research purposes at Nature Scientific Data that does include downloads (with parsed repo URLs, commits + much more) until september:
https://www.nature.com/articles/s41597-022-01819-z
Metadata here
And full data here
Feel free to reach out for more info!

@tlee0818
Copy link

tlee0818 commented Dec 8, 2022

Hi @wschuell, I was exploring the sample dataset but had trouble finding where monthly download counts exist. Could I have pointers to find it?

Thanks!

@wschuell
Copy link

wschuell commented Dec 9, 2022

@tlee0818 I'm replying here now but the discussion should probably continue elsewhere to avoid spamming this issue; you can create an issue on this repo or you can easily find my academic email on csh.ac.at .

Assuming you downloaded the dataset from figshare:

The raw data (by version and day -- corresponding to the data in the original dumps that we could complete thanks to C. Nichols) is in the package_version_downloads table of the SQLite DB, or in the corresponding CSV (careful, the id column does not match the official dumps). Refer to the README for where to find the files.
To get the monthly downloads globally, you can follow this notebook with time_window='month' at the beginning, or by crate you can run this query on the SQLite DB:

SELECT sum(pvd.downloads) AS dl_cnt,date(downloaded_at,'start of month') AS month,p.name FROM package_version_downloads pvd
INNER JOIN package_versions pv
ON pv.id=pvd.package_version
INNER JOIN packages p
ON p.id=pv.package_id
GROUP BY p.name,p.id,month

jdno added a commit to jdno/rust-simpleinfra that referenced this issue May 2, 2024
The crates.io team is considering adding a long-term archive for
download statistics so that historic data can be removed from the
database. The proposal in rust-lang/crates.io#3479 is to put daily CSV
files into an S3 bucket.

A new bucket has been created for this purpose. Other than the existing
buckets, this one is not publicly accessible. The bucket has been
created in the new `crates-io-staging` account so that the team can more
easily access it.
@Turbo87 Turbo87 moved this to For next meeting in crates.io team meetings May 3, 2024
jdno added a commit to jdno/rust-simpleinfra that referenced this issue May 8, 2024
The crates.io team is considering adding a long-term archive for
download statistics so that historic data can be removed from the
database. The proposal in rust-lang/crates.io#3479 is to put daily CSV
files into an S3 bucket.

A new bucket has been created for this purpose. Other than the existing
buckets, this one is not publicly accessible. The bucket has been
created in the new `crates-io-staging` account so that the team can more
easily access it.
@Turbo87
Copy link
Member

Turbo87 commented May 16, 2024

  • Implement ArchiveVersionDownloads background job #8596 has implemented a background worker job which archives old (90+ days) data from the version_downloads table to S3
  • yesterday we archived everything from 2014-11-11 to 2024-02-14 to S3
  • the S3 bucket is not publicly reachable (yet)

@Turbo87
Copy link
Member

Turbo87 commented Aug 24, 2024

the S3 bucket is not publicly reachable (yet)

this has been fixed last week and the archive is now publicly available at https://static.crates.io/archive/version-downloads/ :)

@Turbo87 Turbo87 closed this as completed Aug 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-backend ⚙️ C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear
Projects
None yet
Development

No branches or pull requests

6 participants