-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Added samples: GitHubLabeler and GettingStarted #198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
OliaG
commented
May 22, 2018
- Iris sample
- Sentiment Analysis sample
- TaxiFare sample
- GitHub issues classification sample
- Iris sample - Sentiment Analysis sample - TaxiFare sample -GitHub issues classification sample
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Would recommend Adding a few comments around training and evaluation pipelines for the iris, taxi and github samples as well similar to the sentiment analysis one.
|
||
//add a FastTreeBinaryClassifier, the decision tree learner for this project, and | ||
//three hyperparameters to be used for tuning decision tree performance | ||
pipeline.Add(new FastTreeBinaryClassifier() {NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend Bigrams+Trichargrams w/ AveragedPerceptronBinaryClassifier{iter=10} for text. AveragedPerceptron generally wins on text vs. FastTree in terms of accuracy & speed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I will use them for advance scenario samples. This one is a "Hello world" type of sample where I'm trying to make the pipeline as simple as possible. And in the next samples I'll demonstrate how the results can be improved with different transforms.
|
||
// TextFeaturizer is a transform that will be used to featurize an input column. | ||
// This is used to format and clean the data. | ||
pipeline.Add(new TextFeaturizer("Features", "SentimentText")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be a good opportunity to show off some of the hyperparameters of the TextFeaturizer.
pipeline.Add(new TextFeaturizer("Features", "SentimentText")
{
WordFeatureExtractor = new NGramNgramExtractor() { NgramLength = 2, AllLengths = true },
CharFeatureExtractor = new NGramNgramExtractor() { NgramLength = 3, AllLengths = false }
});
Others of course are available:
pipeline.Add(new TextFeaturizer("Features", "SentimentText") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense in other samples we are adding or going to add. For getting started sample the easier the better.
Console.WriteLine($" AccuracyMacro = {metrics.AccuracyMacro:0.####}, a value between 0 and 1, the closer to 1, the better"); | ||
Console.WriteLine($" AccuracyMicro = {metrics.AccuracyMicro:0.####}, a value between 0 and 1, the closer to 1, the better"); | ||
Console.WriteLine($" LogLoss = {metrics.LogLoss:0.####}, the closer to 0, the better"); | ||
Console.WriteLine($" LogLoss for class 1 = {metrics.PerClassLogLoss[0]:0.####}, the closer to 0, the better"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to get the actual name for class 1
? When a user runs on their own dataset, they will likely be interested in their actual class names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not review the samples in detail. However, I suggest some overall changes:
- It would be good to have XML Docs on the major classes and methods in each project.
- Where will the long form description of each sample go? My vote would be for a
README.md
in each of the project folders.
<?xml version="1.0" encoding="utf-8" ?> | ||
<configuration> | ||
<appSettings> | ||
<!--TODO: Please enter your own credentials here. --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a warning to this such that people know that they should not commit this file to a public repo, once changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How far would adding this to the .gitignore go for helping folks to not commit this back to a repo? I've done regex checking in githooks before (eg: /(password|access_token) ?=/i
), but I don't think githooks can be added to a repo & auto-run by other users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is a great solution for this. In general, the recommendation is to avoid using credentials from code and instead use certificates, user-based authentication, or a service like Azure KeyVault. Unfortunately this doesn't work for samples very well...
Samples/GitHubLabeler/.gitignore
Outdated
@@ -0,0 +1,288 @@ | |||
## Ignore Visual Studio temporary files, build results, and | |||
## files generated by popular Visual Studio add-ons. | |||
## |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think checking in this .gitignore file was intentional, was it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. We should only use the one from the root.
{ | ||
private static string AppPath => Path.GetDirectoryName(Environment.GetCommandLineArgs()[0]); | ||
private static string TrainDataPath => Path.Combine(AppPath, @"..\..\..\..\datasets\", "taxi-fare-train.csv"); | ||
private static string TestDataPath => Path.Combine(AppPath, @"..\..\..\..\datasets\", "taxi-fare-test.csv"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd like these samples to work in multiple places, I'd think, including cross platform. Do these \
style paths work in Linux?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, they don't work on Linux.
Instead, we should use Path.Combine("..", "..", "..", "..", "datasets")
.
private static string AppPath => Path.GetDirectoryName(Environment.GetCommandLineArgs()[0]); | ||
private static string TrainDataPath => Path.Combine(AppPath, @"..\..\..\..\datasets\", "imdb_labelled.txt"); | ||
private static string TestDataPath => Path.Combine(AppPath, @"..\..\..\..\datasets\", "yelp_labelled.txt"); | ||
private static string ModelPath => Path.Combine(AppPath, "Models", "SentimentModel.zip"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Models", "SentimentModel.zip"); [](start = 65, length = 32)
I've noticed you have I think unintentionally also put the model files (the .zips) in your PR. I think that was unintentional, and is just an artifafct of the run. Either way we have the mechanism to create this artifact here, and we generally try to avoid checking in binary artifacts if it can be helped. Could you remove them?
{ | ||
private static async Task Main(string[] args) | ||
{ | ||
if (args.Length != 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to simplify this sample a bit, so that people don't have to learn some syntax of a simple command line program? As far as I see the sample should just be a straight train of a model, followed by a label, right? Or are we showing how the artifact is preserved across trainings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the intention was to demonstrate how it is used in real life where you train it just once and inference many times. Completely agree that for the tutorial samples we want them to be simple train-evaluate-predict. This one is an end-to-end app that you would use for labeling issues. I will probably move it to separate folder for "apps infused with ML" examples.
Samples/GitHubLabeler/README.md
Outdated
## Labeling | ||
When the model is trained, it can be used for predicting new issue's label. To do so, run the application with `"label"` key: | ||
``` | ||
C:\GitHubLabeler\GitHubLabeler\bin\Debug\netcoreapp2.0>dotnet GitHubLabeler.dll label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C:\GitHubLabeler\GitHubLabeler\bin\Debug\netcoreapp2.0> [](start = 0, length = 55)
Is it possible to make these samples a bit less Windows specific? That is, perhaps the command to enter should be just on a line somewhere, and you elsewhere say, that it will be placed in such-and-such a directory.
Samples/GitHubLabeler/README.md
Outdated
This is a simple prototype application to demonstrate how to use [ML.NET](https://www.nuget.org/packages/Microsoft.ML/) APIs. The main focus is on creating, training, and using ML (Machine Learning) model that is implemented in Predictor.cs class. | ||
|
||
## Overview | ||
GitHubLabeler is a .NET Core console application that runs from command-line interface (CLI) and allows to: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allows to [](start = 97, length = 9)
Missing pronoun of some sort... allows you to, allows one to, other.
using Microsoft.ML.Trainers; | ||
using Microsoft.ML.Transforms; | ||
using Microsoft.ML; | ||
using System.Threading.Tasks; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether we like Microsoft
before System
or System
before Microsoft
, but we should probably pick one and make sure it's consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most developers prefer System
go first and have all others alphabetically sorted afterwards.
|
||
private static void Evaluate(PredictionModel<TaxiTrip, TaxiTripFarePrediction> model) | ||
{ | ||
var testData = new TextLoader<TaxiTrip>(TestDataPath, useHeader: true, separator: ","); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var testData = new TextLoader(TestDataPath, useHeader: true, separator: ","); [](start = 12, length = 87)
Something is very wrong here... the loader is part of the model, or ought to be. We shouldn't have to respecify it, any more than we have to respecify the transforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for evaluation it currently has to be specified, but this should be fixed. See #5 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,2 @@ | |||
<wpf:ResourceDictionary xml:space="preserve" xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml" xmlns:s="clr-namespace:System;assembly=mscorlib" xmlns:ss="urn:shemas-jetbrains-com:settings-storage-xaml" xmlns:wpf="http://schemas.microsoft.com/winfx/2006/xaml/presentation"> | |||
<s:String x:Key="/Default/CodeInspection/CSharpLanguageProject/LanguageLevel/@EntryValue">CSharp72</s:String></wpf:ResourceDictionary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shoudn't commit this file as this is from ReSharper ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove it
Samples/GitHubLabeler/README.md
Outdated
|
||
b. To work with labels from your GitHub repository, you will need to train the model on your data. To do so, export GitHub issues from your repository into `.tsv` file with the following columns: | ||
* ID - issue’s ID | ||
* Area - issue’s label (named this way to avoid confusion with the Label concept in ML.NET) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
’ [](start = 18, length = 1)
Should we avoid non-ascii characters unless we actually need them? I think I'd prefer a plain-old single quote here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
PassengerCount = 1, | ||
TripDistance = 10.33f, | ||
PaymentType = "CSH", | ||
FareAmount = 0 // predict it. actual = 29.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we allow this field to be a null or NaN? It could help the user understand this isn't an input but the output calculated value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I can make it nullable type, getting Type mismatch exception.
datasets/README.md
Outdated
@@ -0,0 +1,11 @@ | |||
MICROSOFT PROVIDES THE DATASETS ON AN "AS IS" BASIS. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, GUARANTEES OR CONDITIONS WITH RESPECT TO YOUR USE OF THE DATASETS. TO THE EXTENT PERMITTED UNDER YOUR LOCAL LAW, MICROSOFT DISCLAIMS ALL LIABILITY FOR ANY DAMAGES OR LOSSES, INLCUDING DIRECT, CONSEQUENTIAL, SPECIAL, INDIRECT, INCIDENTAL OR PUNITIVE, RESULTING FROM YOUR USE OF THE DATASETS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to put all the datasets in a single place - test/data
or similar. That way they aren't duplicated unintentionally, and there is one place to see all of the datasets in the repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, was thinking about it too. I suggest moving datasets from test to the folder datasets in the root and merge them with data that is there for samples. This way it will be obvious for users where to look for it. I can do it if no objections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's another competing concern of having too many "things" in the root folder.
A structure I could imagine working would be:
build \
docs \
samples \
datasets \ (or just 'data'?)
GettingStarted \
GitHubLabeler \
src \
Project1 \
Project2 \
test \
Test1 \
Test2 \
And all the tests changing to instead of read from test\data
, they now point to samples\datasets
(or samples\data
).
<TargetFramework>netcoreapp2.0</TargetFramework> | ||
</PropertyGroup> | ||
|
||
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|AnyCPU'"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PropertyGroup is incorrect, as the property below won't be set for Release builds.
If you want to set <LangVersion>
to latest, it should be done unconditionally. You can move <LangVersion>latest</LangVersion>
to the above PropertyGroup.
<TargetFramework>netcoreapp2.0</TargetFramework> | ||
</PropertyGroup> | ||
|
||
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|AnyCPU'"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here as above -
Move <LangVersion>latest</LangVersion>
into the unconditional property group, and remove these two.
Hi @OliaG thanks so much for these examples. Let's talk about the size of the example files: Taxi test: about 25 MB. For machine learning these are pretty small datasets, but they're pretty big as things to commit into a git repo. Now, we could perhaps do something like utilize Git LFS, which would be an OK remediation except that I see that this PR was done not by working out of a fork then trying to merge back into the main repo, but rather was done by establishing a branch within the main repo itself. So even if we fix it to not have these large files directly, this repo itself is somewhat tainted, since a clone or fork of the repo will contain these large files. Do we know how to clean this up? I know it's possible, I just don't know how offhand. Maybe @eerhardt ? Maybe development in forks a good idea, as opposed to branches in the main repo. If we look at the other PRs currently open, they're from branches in people's forks of this repo, not the repo itself. Perhaps we should clarify that somewhere. (Contribution guide, other?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi all, sorry just waiting on changes, especially over large files. I noticed these changes were approved immediately, just want to make sure nothing bad happens.
datasets/imdb_labelled.txt
Outdated
@@ -0,0 +1,1000 @@ | |||
A very, very, very slow-moving, aimless movie about a distressed, drifting young man. 0 | |||
Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out. 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Labeled, not labelled, FYI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is British VS US spelling. And it is original dataset's name, I did not change it. But I can remove extra "l" :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When in Rome ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh really? Hmmm. I never knew that. How strange. I guess if that's the original name we should keep it.
* added ceiling to mathutils and have parquetloader default to sequential reading instead of throwing * factor out sequence creation and add check for overflow in MathUtils
…ph variable outputs (#152) * Adding support for training metrics in PipelineSweeperMacro and needed support files. Also includes new output information in PipelineSweeperMacro output graph to make consumption of returned pipelines easier. * Changed where XML comment was placed. * Added more tests (uncommented and fixed) for auto inference. Changed magic number in AutoMlUtils to be mix double value, per review comment. * Added another test, TestPipelineSweeperMacroNoTransforms. * Updated tests to include warning disabling (following Zeeshan S's example) to get build working. * Changes to checks in AutoMlUtils (more correct usage of them). * Fixing issue with ExceptParam using value and not name of parameters. * Fixing errors on use of ExceptParam.
{ | ||
public class SentimentData | ||
{ | ||
[Column("0")] public string SentimentText; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest putting the attributes on separate lines as that's more in line with default formatting.
{ | ||
private static string AppPath => Path.GetDirectoryName(Environment.GetCommandLineArgs()[0]); | ||
private static string TrainDataPath => Path.Combine(AppPath, @"..\..\..\..\datasets\", "iris_train.txt"); | ||
private static string TestDataPath => Path.Combine(AppPath, @"..\..\..\..\datasets\", "iris_test.txt"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use backslashes as this doesn't work cross-platform. You should do:
Path.Combine(AppPath, "..", "..", "..", "..", "datasets", "iris_train.txt");
using Microsoft.ML.Trainers; | ||
using Microsoft.ML.Transforms; | ||
using Microsoft.ML; | ||
using System.Threading.Tasks; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most developers prefer System
go first and have all others alphabetically sorted afterwards.
Samples/GitHubLabeler/.gitignore
Outdated
@@ -0,0 +1,288 @@ | |||
## Ignore Visual Studio temporary files, build results, and | |||
## files generated by popular Visual Studio add-ons. | |||
## |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. We should only use the one from the root.
<?xml version="1.0" encoding="utf-8" ?> | ||
<configuration> | ||
<appSettings> | ||
<!--TODO: Please enter your own credentials here. --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is a great solution for this. In general, the recommendation is to avoid using credentials from code and instead use certificates, user-based authentication, or a service like Azure KeyVault. Unfortunately this doesn't work for samples very well...
@@ -0,0 +1,2 @@ | |||
<wpf:ResourceDictionary xml:space="preserve" xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml" xmlns:s="clr-namespace:System;assembly=mscorlib" xmlns:ss="urn:shemas-jetbrains-com:settings-storage-xaml" xmlns:wpf="http://schemas.microsoft.com/winfx/2006/xaml/presentation"> | |||
<s:String x:Key="/Default/CodeInspection/CSharpLanguageProject/LanguageLevel/@EntryValue">CSharp72</s:String></wpf:ResourceDictionary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shoudn't commit this file as this is from ReSharper ;-)
Samples/GitHubLabeler/README.md
Outdated
|
||
b. To work with labels from your GitHub repository, you will need to train the model on your data. To do so, export GitHub issues from your repository into `.tsv` file with the following columns: | ||
* ID - issue’s ID | ||
* Area - issue’s label (named this way to avoid confusion with the Label concept in ML.NET) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
* Reduce number of hash bits in stratification column and add a unit test. * Address PR comments.
* replace housing uci dataset to wine quality
Moved datasets from tests to examples Removed model files
- Iris sample - Sentiment Analysis sample - TaxiFare sample -GitHub issues classification sample
Moved datasets from tests to examples Removed model files
This is a very nice PR, but shouldn't we keep all samples/examples in the same place? So far, we have been collecting such samples in tests/Microsoft/ML/Scenarios. |
@KrzysztofCwalina regarding keep all samples/example in the same place, here is my understanding. test/Microsoft/ML/Scenarios are integration tests the primary goal of which is to test our code end-to-end. We can run them to make sure there are no breaking changes introduced by new functionality. The examples (this PR) are a way of showcasing our functionality to users. They are not tests but independent projects, sometimes applications containing additional logic (like posting requests to GitHub). The examples have to be discoverable for users (like root folder “examples”, it will be hard for users to find them here: test/Microsoft/ML/Scenarios) and F5-ble, users should be able do download just one example and successfully run it. You can’t do it with integration tests. So for me they belong to different areas and that’s why I would keep tests in tests, and this samples in examples folder. It is a good idea though to mention these tests in docs or readme, so users can find additional use cases. |
@@ -0,0 +1,191 @@ | |||
# IDV File Format | |||
|
|||
This document describes ML.NET's Binary dataview file format, version 1.1.1.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks familiar... was this intended to be part of this PR? Was there a bad merge somewhere?