-
Notifications
You must be signed in to change notification settings - Fork 639
Archive old entries in version_downloads
table
#3479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
…, r=pietroalbini Include only the last 90 days of downloads in our database dumps In #3479 we plan to drop old entries and archive them in some other way, so old entries will eventually disappear from dumps anyway. This should make use of the database dumps much more practical for daily use. I think it would be reasonable to even limit this to the past week of data. r? `@pietroalbini` cc `@kornelski,` #2078
Summary from the team meeting today:
|
Any update on this, Specifically how to access older data? |
I can provide a complete version_downloads table dating back to 2014-11-11 to whoever sends me a preferred way to share a large file. Currently the csv is 120,554,803 rows, 2.3GB, gzipped is 361MB. |
Thanks for your help. If you can upload it to any file sharing service, that would be really helpful. Gdrive/Dropbox/Onedrive or any service you prefer. |
https://send.vis.ee/download/6030078658da7a07/#QzIAS1VImWg0p5WfEAi9Dw $ zcat version_downloads.csv.gz | (head; tail)
date,version_id,downloads
2014-11-11,6,7
2014-11-11,9,1
2014-11-11,10,1
2014-11-11,12,1
2014-11-11,13,1
2014-11-11,15,1
2014-11-11,16,1
2014-11-11,17,1
2014-11-11,20,1
2022-08-10,599691,6
2022-08-10,599692,6
2022-08-10,599693,6
2022-08-10,599694,4
2022-08-10,599695,5
2022-08-10,599696,5
2022-08-10,599697,5
2022-08-10,599698,4
2022-08-10,599699,4
2022-08-10,599700,4 Data from the last day is obviously partial because the day is not over yet. |
@dtolnay Thank you so much!! |
Hi! My name is Tak-Ho Lee, and I am conducting research on open-source sustainability at the School of Computer Science at CMU, under Dr. Christian Kaestner. Carol Nichols directed me here. I want to gather project data, including the repository link, download counts, etc. As the issue mentions, the DB dump only has the past 90 days, so I was wondering if I could receive the CSV you're hosting (the link has expired it says). |
Hi @tlee0818 , we just published a dataset for research purposes at Nature Scientific Data that does include downloads (with parsed repo URLs, commits + much more) until september: |
Hi @wschuell, I was exploring the sample dataset but had trouble finding where monthly download counts exist. Could I have pointers to find it? Thanks! |
@tlee0818 I'm replying here now but the discussion should probably continue elsewhere to avoid spamming this issue; you can create an issue on this repo or you can easily find my academic email on csh.ac.at . Assuming you downloaded the dataset from figshare: The raw data (by version and day -- corresponding to the data in the original dumps that we could complete thanks to C. Nichols) is in the package_version_downloads table of the SQLite DB, or in the corresponding CSV (careful, the id column does not match the official dumps). Refer to the README for where to find the files.
|
The crates.io team is considering adding a long-term archive for download statistics so that historic data can be removed from the database. The proposal in rust-lang/crates.io#3479 is to put daily CSV files into an S3 bucket. A new bucket has been created for this purpose. Other than the existing buckets, this one is not publicly accessible. The bucket has been created in the new `crates-io-staging` account so that the team can more easily access it.
The crates.io team is considering adding a long-term archive for download statistics so that historic data can be removed from the database. The proposal in rust-lang/crates.io#3479 is to put daily CSV files into an S3 bucket. A new bucket has been created for this purpose. Other than the existing buckets, this one is not publicly accessible. The bucket has been created in the new `crates-io-staging` account so that the team can more easily access it.
|
this has been fixed last week and the archive is now publicly available at https://static.crates.io/archive/version-downloads/ :) |
We should find a place to archive daily download counts, and drop old entries from
version_downloads
. We only ever query for the last 90 days of recent downloads. We could upload a CSV of the previous day's downloads to S3 as part of a daily background job.Currently, the
version_downloads
table consumes 4241 MB and its primary key index consumes 1825 MB. Reducing the size of this table should greatly reduce cache pressure on our database sever (with 4GB of RAM) and will make the size of our experimental database dumps much more practical.The text was updated successfully, but these errors were encountered: