-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Taxi fare dataset is almost 50MB #206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hmmm. This means an interactive rebase of Would have been nice had this been caught in PR. Maybe whatever bot we're using to validate/build releases can flag a PR if it contains huge files like this, going forward. |
Hi Justin! Yup, I know how to do it more or less, but here's what I'm imagining. One way or another this will involve repointing master to a completely different commit id. Once that happens, people that are attempting to write PRs against master will have an interesting experience when they attempt to merge master. (I think rebases might be fine.) Which is why I'd love to get to the point where there's some check. I'd love to give the people that have the power to approve the PR some "help" so they get a hint that not all is as it should be, since I guess the line change count being north of 2 million didn't do it. Maybe a big red flag somewhere? |
@OliaG @aditidugar These datasets were checked-in by you. Can you please clarify why you need 1 million rows for training a sample and another 1 million rows for testing it? |
@TomFinley that PR was right before the release, they should not have been merged with that size, however now they are there and we will find a way to clean them up. |
Yep, @shauheen covered it - we can certainly trim this down if necessary, it wasn't something that we consciously considered before the initial check in. |
Should we create another repository like "dotnet/machinelearning/datasets" and store this files there? |
As @TomFinley said, removing the file from the repo won't have an impact on size unless we rebase the offending commit out. Which is doable but it requires all developers to effectively force reset their local histories to match master, which is what Tom described as a nightmare. It's doable, but it won't be a low-impact change for the team. I'm fairly good at Git and I'm happy to help here, but it will require coordination across all developers who have forks/clones, including with the internal VSTS mirror. |
I can help here too, if we decide to move forward with this. The internal VSTS mirror would need some work, but it wouldn't take more than 10/15 minutes. (Note: We've done it before ;)) |
I think that was the plan we came up with on #198 (comment). I guess the general approaches are:
|
So since no one want to rebase master and we figure out ways to provide test files into repo (through nuget or download during build) I'm closing this issue. |
These two files is almost 50mb altogether
https://github.com/dotnet/machinelearning/blob/master/test/data/taxi-fare-test.csv
https://github.com/dotnet/machinelearning/blob/master/test/data/taxi-fare-train.csv
#170 allows to download files from external sources. Can we move these files to separate repository and clean history?
The text was updated successfully, but these errors were encountered: