Taxi fare dataset is almost 50MB #206

Ivanidzo4ka · 2018-05-22T19:29:03Z

These two files is almost 50mb altogether
https://github.com/dotnet/machinelearning/blob/master/test/data/taxi-fare-test.csv
https://github.com/dotnet/machinelearning/blob/master/test/data/taxi-fare-train.csv

#170 allows to download files from external sources. Can we move these files to separate repository and clean history?

TomFinley · 2018-05-22T23:14:55Z

Hmmm. This means an interactive rebase of master. What a nightmare. If we have to we have to I guess.

Would have been nice had this been caught in PR. Maybe whatever bot we're using to validate/build releases can flag a PR if it contains huge files like this, going forward.

justinormont · 2018-05-22T23:56:32Z

I'd checkout: https://help.github.com/articles/removing-sensitive-data-from-a-repository/

TomFinley · 2018-05-23T03:11:24Z

Hi Justin! Yup, I know how to do it more or less, but here's what I'm imagining. One way or another this will involve repointing master to a completely different commit id. Once that happens, people that are attempting to write PRs against master will have an interesting experience when they attempt to merge master. (I think rebases might be fine.)

Which is why I'd love to get to the point where there's some check. I'd love to give the people that have the power to approve the PR some "help" so they get a hint that not all is as it should be, since I guess the line change count being north of 2 million didn't do it. Maybe a big red flag somewhere?

codemzs · 2018-05-23T03:29:27Z

@OliaG @aditidugar These datasets were checked-in by you. Can you please clarify why you need 1 million rows for training a sample and another 1 million rows for testing it?

CC: @asthana86 @terrajobst

shauheen · 2018-05-23T14:20:42Z

@TomFinley that PR was right before the release, they should not have been merged with that size, however now they are there and we will find a way to clean them up.

aditidugar-zz · 2018-05-23T16:31:16Z

Yep, @shauheen covered it - we can certainly trim this down if necessary, it wasn't something that we consciously considered before the initial check in.

Ivanidzo4ka · 2018-05-23T17:19:48Z

Should we create another repository like "dotnet/machinelearning/datasets" and store this files there?
I have code to download files and put them into repo, but I need place to keep this files, since many of them either behind authorization page, or get slightly modified in order to make them more readable.

terrajobst · 2018-05-23T17:37:53Z

As @TomFinley said, removing the file from the repo won't have an impact on size unless we rebase the offending commit out. Which is doable but it requires all developers to effectively force reset their local histories to match master, which is what Tom described as a nightmare. It's doable, but it won't be a low-impact change for the team. I'm fairly good at Git and I'm happy to help here, but it will require coordination across all developers who have forks/clones, including with the internal VSTS mirror.

eerhardt · 2018-05-23T19:06:11Z

I can help here too, if we decide to move forward with this.

The internal VSTS mirror would need some work, but it wouldn't take more than 10/15 minutes. (Note: We've done it before ;))

eerhardt · 2018-05-23T19:11:49Z

Should we create another repository like "dotnet/machinelearning/datasets" and store this files there?

I think that was the plan we came up with on #198 (comment). I guess the general approaches are:

Small data set that we can redistribute: Place in dotnet/machinelearning.
Large data set that we can redistribute: Place in a separate repo. Can download at build time using mechanism in switch housing dataset to wine #170 from GitHub URL. If needed for a sample app, make a NuGet package from that repo so it can be restored into the sample app's project.
Data set we can't redistribute: Use mechanism in switch housing dataset to wine #170.

Ivanidzo4ka · 2018-10-18T16:29:29Z

So since no one want to rebase master and we figure out ways to provide test files into repo (through nuget or download during build) I'm closing this issue.

shauheen added the enhancement New feature or request label May 22, 2018

shauheen added this to the 0518 milestone May 22, 2018

Ivanidzo4ka closed this as completed May 23, 2018

Ivanidzo4ka reopened this May 23, 2018

Ivanidzo4ka closed this as completed May 23, 2018

Ivanidzo4ka reopened this May 23, 2018

shauheen removed this from the 0518 milestone May 30, 2018

Ivanidzo4ka closed this as completed Oct 18, 2018

justinormont mentioned this issue Apr 24, 2019

[AutoML] Add AutoML example code #3458

Merged

ghost locked as resolved and limited conversation to collaborators Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taxi fare dataset is almost 50MB #206

Taxi fare dataset is almost 50MB #206

Ivanidzo4ka commented May 22, 2018

TomFinley commented May 22, 2018

justinormont commented May 22, 2018

TomFinley commented May 23, 2018 •

edited

Loading

codemzs commented May 23, 2018 •

edited

Loading

shauheen commented May 23, 2018

aditidugar-zz commented May 23, 2018

Ivanidzo4ka commented May 23, 2018

terrajobst commented May 23, 2018

eerhardt commented May 23, 2018

eerhardt commented May 23, 2018

Ivanidzo4ka commented Oct 18, 2018

Taxi fare dataset is almost 50MB #206

Taxi fare dataset is almost 50MB #206

Comments

Ivanidzo4ka commented May 22, 2018

TomFinley commented May 22, 2018

justinormont commented May 22, 2018

TomFinley commented May 23, 2018 • edited Loading

codemzs commented May 23, 2018 • edited Loading

shauheen commented May 23, 2018

aditidugar-zz commented May 23, 2018

Ivanidzo4ka commented May 23, 2018

terrajobst commented May 23, 2018

eerhardt commented May 23, 2018

eerhardt commented May 23, 2018

Ivanidzo4ka commented Oct 18, 2018

TomFinley commented May 23, 2018 •

edited

Loading

codemzs commented May 23, 2018 •

edited

Loading