Skip to content

Taxi fare dataset is almost 50MB #206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Ivanidzo4ka opened this issue May 22, 2018 · 11 comments
Closed

Taxi fare dataset is almost 50MB #206

Ivanidzo4ka opened this issue May 22, 2018 · 11 comments
Labels
enhancement New feature or request

Comments

@Ivanidzo4ka
Copy link
Contributor

These two files is almost 50mb altogether
https://github.com/dotnet/machinelearning/blob/master/test/data/taxi-fare-test.csv
https://github.com/dotnet/machinelearning/blob/master/test/data/taxi-fare-train.csv

#170 allows to download files from external sources. Can we move these files to separate repository and clean history?

@shauheen shauheen added the enhancement New feature or request label May 22, 2018
@shauheen shauheen added this to the 0518 milestone May 22, 2018
@TomFinley
Copy link
Contributor

Hmmm. This means an interactive rebase of master. What a nightmare. If we have to we have to I guess.

Would have been nice had this been caught in PR. Maybe whatever bot we're using to validate/build releases can flag a PR if it contains huge files like this, going forward.

@justinormont
Copy link
Contributor

@TomFinley
Copy link
Contributor

TomFinley commented May 23, 2018

Hi Justin! Yup, I know how to do it more or less, but here's what I'm imagining. One way or another this will involve repointing master to a completely different commit id. Once that happens, people that are attempting to write PRs against master will have an interesting experience when they attempt to merge master. (I think rebases might be fine.)

Which is why I'd love to get to the point where there's some check. I'd love to give the people that have the power to approve the PR some "help" so they get a hint that not all is as it should be, since I guess the line change count being north of 2 million didn't do it. Maybe a big red flag somewhere?

@codemzs
Copy link
Member

codemzs commented May 23, 2018

@OliaG @aditidugar These datasets were checked-in by you. Can you please clarify why you need 1 million rows for training a sample and another 1 million rows for testing it?

CC: @asthana86 @terrajobst

@shauheen
Copy link
Contributor

@TomFinley that PR was right before the release, they should not have been merged with that size, however now they are there and we will find a way to clean them up.

@aditidugar-zz
Copy link

Yep, @shauheen covered it - we can certainly trim this down if necessary, it wasn't something that we consciously considered before the initial check in.

@Ivanidzo4ka
Copy link
Contributor Author

Should we create another repository like "dotnet/machinelearning/datasets" and store this files there?
I have code to download files and put them into repo, but I need place to keep this files, since many of them either behind authorization page, or get slightly modified in order to make them more readable.

@terrajobst
Copy link

As @TomFinley said, removing the file from the repo won't have an impact on size unless we rebase the offending commit out. Which is doable but it requires all developers to effectively force reset their local histories to match master, which is what Tom described as a nightmare. It's doable, but it won't be a low-impact change for the team. I'm fairly good at Git and I'm happy to help here, but it will require coordination across all developers who have forks/clones, including with the internal VSTS mirror.

@eerhardt
Copy link
Member

I can help here too, if we decide to move forward with this.

The internal VSTS mirror would need some work, but it wouldn't take more than 10/15 minutes. (Note: We've done it before ;))

@eerhardt
Copy link
Member

Should we create another repository like "dotnet/machinelearning/datasets" and store this files there?

I think that was the plan we came up with on #198 (comment). I guess the general approaches are:

  • Small data set that we can redistribute: Place in dotnet/machinelearning.
  • Large data set that we can redistribute: Place in a separate repo. Can download at build time using mechanism in switch housing dataset to wine #170 from GitHub URL. If needed for a sample app, make a NuGet package from that repo so it can be restored into the sample app's project.
  • Data set we can't redistribute: Use mechanism in switch housing dataset to wine #170.

@shauheen shauheen removed this from the 0518 milestone May 30, 2018
@Ivanidzo4ka
Copy link
Contributor Author

So since no one want to rebase master and we figure out ways to provide test files into repo (through nuget or download during build) I'm closing this issue.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants