Skip to content

Added samples: GitHubLabeler and GettingStarted #198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from
Closed

Conversation

OliaG
Copy link
Contributor

@OliaG OliaG commented May 22, 2018

  • Iris sample
  • Sentiment Analysis sample
  • TaxiFare sample
  • GitHub issues classification sample

- Iris sample
- Sentiment Analysis sample
- TaxiFare sample
-GitHub issues classification sample
Copy link
Contributor

@asthana86 asthana86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Would recommend Adding a few comments around training and evaluation pipelines for the iris, taxi and github samples as well similar to the sentiment analysis one.

@asthana86 asthana86 requested a review from danmoseley May 22, 2018 06:34

//add a FastTreeBinaryClassifier, the decision tree learner for this project, and
//three hyperparameters to be used for tuning decision tree performance
pipeline.Add(new FastTreeBinaryClassifier() {NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend Bigrams+Trichargrams w/ AveragedPerceptronBinaryClassifier{iter=10} for text. AveragedPerceptron generally wins on text vs. FastTree in terms of accuracy & speed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I will use them for advance scenario samples. This one is a "Hello world" type of sample where I'm trying to make the pipeline as simple as possible. And in the next samples I'll demonstrate how the results can be improved with different transforms.


// TextFeaturizer is a transform that will be used to featurize an input column.
// This is used to format and clean the data.
pipeline.Add(new TextFeaturizer("Features", "SentimentText"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be a good opportunity to show off some of the hyperparameters of the TextFeaturizer.

pipeline.Add(new TextFeaturizer("Features", "SentimentText")
{
    WordFeatureExtractor = new NGramNgramExtractor() { NgramLength = 2, AllLengths = true },
    CharFeatureExtractor = new NGramNgramExtractor() { NgramLength = 3, AllLengths = false }      
});

Others of course are available:

pipeline.Add(new TextFeaturizer("Features", "SentimentText")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense in other samples we are adding or going to add. For getting started sample the easier the better.

Console.WriteLine($" AccuracyMacro = {metrics.AccuracyMacro:0.####}, a value between 0 and 1, the closer to 1, the better");
Console.WriteLine($" AccuracyMicro = {metrics.AccuracyMicro:0.####}, a value between 0 and 1, the closer to 1, the better");
Console.WriteLine($" LogLoss = {metrics.LogLoss:0.####}, the closer to 0, the better");
Console.WriteLine($" LogLoss for class 1 = {metrics.PerClassLogLoss[0]:0.####}, the closer to 0, the better");
Copy link
Contributor

@justinormont justinormont May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to get the actual name for class 1? When a user runs on their own dataset, they will likely be interested in their actual class names.

@shauheen
Copy link
Contributor

This PR closes #140 @OliaG

@shauheen shauheen closed this May 22, 2018
@shauheen shauheen reopened this May 22, 2018
Copy link
Member

@markusweimer markusweimer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not review the samples in detail. However, I suggest some overall changes:

  • It would be good to have XML Docs on the major classes and methods in each project.
  • Where will the long form description of each sample go? My vote would be for a README.md in each of the project folders.

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
<appSettings>
<!--TODO: Please enter your own credentials here. -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a warning to this such that people know that they should not commit this file to a public repo, once changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How far would adding this to the .gitignore go for helping folks to not commit this back to a repo? I've done regex checking in githooks before (eg: /(password|access_token) ?=/i), but I don't think githooks can be added to a repo & auto-run by other users.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a great solution for this. In general, the recommendation is to avoid using credentials from code and instead use certificates, user-based authentication, or a service like Azure KeyVault. Unfortunately this doesn't work for samples very well...

@@ -0,0 +1,288 @@
## Ignore Visual Studio temporary files, build results, and
## files generated by popular Visual Studio add-ons.
##
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think checking in this .gitignore file was intentional, was it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. We should only use the one from the root.

{
private static string AppPath => Path.GetDirectoryName(Environment.GetCommandLineArgs()[0]);
private static string TrainDataPath => Path.Combine(AppPath, @"..\..\..\..\datasets\", "taxi-fare-train.csv");
private static string TestDataPath => Path.Combine(AppPath, @"..\..\..\..\datasets\", "taxi-fare-test.csv");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd like these samples to work in multiple places, I'd think, including cross platform. Do these \ style paths work in Linux?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they don't work on Linux.

Instead, we should use Path.Combine("..", "..", "..", "..", "datasets").

private static string AppPath => Path.GetDirectoryName(Environment.GetCommandLineArgs()[0]);
private static string TrainDataPath => Path.Combine(AppPath, @"..\..\..\..\datasets\", "imdb_labelled.txt");
private static string TestDataPath => Path.Combine(AppPath, @"..\..\..\..\datasets\", "yelp_labelled.txt");
private static string ModelPath => Path.Combine(AppPath, "Models", "SentimentModel.zip");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Models", "SentimentModel.zip"); [](start = 65, length = 32)

I've noticed you have I think unintentionally also put the model files (the .zips) in your PR. I think that was unintentional, and is just an artifafct of the run. Either way we have the mechanism to create this artifact here, and we generally try to avoid checking in binary artifacts if it can be helped. Could you remove them?

{
private static async Task Main(string[] args)
{
if (args.Length != 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to simplify this sample a bit, so that people don't have to learn some syntax of a simple command line program? As far as I see the sample should just be a straight train of a model, followed by a label, right? Or are we showing how the artifact is preserved across trainings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the intention was to demonstrate how it is used in real life where you train it just once and inference many times. Completely agree that for the tutorial samples we want them to be simple train-evaluate-predict. This one is an end-to-end app that you would use for labeling issues. I will probably move it to separate folder for "apps infused with ML" examples.

## Labeling
When the model is trained, it can be used for predicting new issue's label. To do so, run the application with `"label"` key:
```
C:\GitHubLabeler\GitHubLabeler\bin\Debug\netcoreapp2.0>dotnet GitHubLabeler.dll label
Copy link
Contributor

@TomFinley TomFinley May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C:\GitHubLabeler\GitHubLabeler\bin\Debug\netcoreapp2.0> [](start = 0, length = 55)

Is it possible to make these samples a bit less Windows specific? That is, perhaps the command to enter should be just on a line somewhere, and you elsewhere say, that it will be placed in such-and-such a directory.

This is a simple prototype application to demonstrate how to use [ML.NET](https://www.nuget.org/packages/Microsoft.ML/) APIs. The main focus is on creating, training, and using ML (Machine Learning) model that is implemented in Predictor.cs class.

## Overview
GitHubLabeler is a .NET Core console application that runs from command-line interface (CLI) and allows to:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allows to [](start = 97, length = 9)

Missing pronoun of some sort... allows you to, allows one to, other.

using Microsoft.ML.Trainers;
using Microsoft.ML.Transforms;
using Microsoft.ML;
using System.Threading.Tasks;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether we like Microsoft before System or System before Microsoft, but we should probably pick one and make sure it's consistent.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most developers prefer System go first and have all others alphabetically sorted afterwards.


private static void Evaluate(PredictionModel<TaxiTrip, TaxiTripFarePrediction> model)
{
var testData = new TextLoader<TaxiTrip>(TestDataPath, useHeader: true, separator: ",");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var testData = new TextLoader(TestDataPath, useHeader: true, separator: ","); [](start = 12, length = 87)

Something is very wrong here... the loader is part of the model, or ought to be. We shouldn't have to respecify it, any more than we have to respecify the transforms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for evaluation it currently has to be specified, but this should be fixed. See #5 .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was working on #5 and found out that Loader is actually not part of the trained model in ML.Net. I have opened another issue #216 for this blocker.

@@ -0,0 +1,2 @@
<wpf:ResourceDictionary xml:space="preserve" xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml" xmlns:s="clr-namespace:System;assembly=mscorlib" xmlns:ss="urn:shemas-jetbrains-com:settings-storage-xaml" xmlns:wpf="http://schemas.microsoft.com/winfx/2006/xaml/presentation">
<s:String x:Key="/Default/CodeInspection/CSharpLanguageProject/LanguageLevel/@EntryValue">CSharp72</s:String></wpf:ResourceDictionary>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shoudn't commit this file as this is from ReSharper ;-)

Copy link
Contributor Author

@OliaG OliaG May 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove it


b. To work with labels from your GitHub repository, you will need to train the model on your data. To do so, export GitHub issues from your repository into `.tsv` file with the following columns:
* ID - issue’s ID
* Area - issue’s label (named this way to avoid confusion with the Label concept in ML.NET)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

’ [](start = 18, length = 1)

Should we avoid non-ascii characters unless we actually need them? I think I'd prefer a plain-old single quote here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

PassengerCount = 1,
TripDistance = 10.33f,
PaymentType = "CSH",
FareAmount = 0 // predict it. actual = 29.5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we allow this field to be a null or NaN? It could help the user understand this isn't an input but the output calculated value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I can make it nullable type, getting Type mismatch exception.

@@ -0,0 +1,11 @@
MICROSOFT PROVIDES THE DATASETS ON AN "AS IS" BASIS. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, GUARANTEES OR CONDITIONS WITH RESPECT TO YOUR USE OF THE DATASETS. TO THE EXTENT PERMITTED UNDER YOUR LOCAL LAW, MICROSOFT DISCLAIMS ALL LIABILITY FOR ANY DAMAGES OR LOSSES, INLCUDING DIRECT, CONSEQUENTIAL, SPECIAL, INDIRECT, INCIDENTAL OR PUNITIVE, RESULTING FROM YOUR USE OF THE DATASETS.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to put all the datasets in a single place - test/data or similar. That way they aren't duplicated unintentionally, and there is one place to see all of the datasets in the repo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, was thinking about it too. I suggest moving datasets from test to the folder datasets in the root and merge them with data that is there for samples. This way it will be obvious for users where to look for it. I can do it if no objections.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another competing concern of having too many "things" in the root folder.

A structure I could imagine working would be:

build \

docs \

samples \
  datasets \  (or just 'data'?)
  GettingStarted \
  GitHubLabeler \

src \
  Project1 \
  Project2 \

test \
  Test1 \
  Test2 \

And all the tests changing to instead of read from test\data, they now point to samples\datasets (or samples\data).

<TargetFramework>netcoreapp2.0</TargetFramework>
</PropertyGroup>

<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|AnyCPU'">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PropertyGroup is incorrect, as the property below won't be set for Release builds.

If you want to set <LangVersion> to latest, it should be done unconditionally. You can move <LangVersion>latest</LangVersion> to the above PropertyGroup.

<TargetFramework>netcoreapp2.0</TargetFramework>
</PropertyGroup>

<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|AnyCPU'">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as above -

Move <LangVersion>latest</LangVersion> into the unconditional property group, and remove these two.

@TomFinley
Copy link
Contributor

Hi @OliaG thanks so much for these examples. Let's talk about the size of the example files:

Taxi test: about 25 MB.
Taxi train: about 25 MB.
CoreFX github issue train: about 20 MB.

For machine learning these are pretty small datasets, but they're pretty big as things to commit into a git repo.

Now, we could perhaps do something like utilize Git LFS, which would be an OK remediation except that I see that this PR was done not by working out of a fork then trying to merge back into the main repo, but rather was done by establishing a branch within the main repo itself. So even if we fix it to not have these large files directly, this repo itself is somewhat tainted, since a clone or fork of the repo will contain these large files.

Do we know how to clean this up? I know it's possible, I just don't know how offhand. Maybe @eerhardt ?

Maybe development in forks a good idea, as opposed to branches in the main repo. If we look at the other PRs currently open, they're from branches in people's forks of this repo, not the repo itself. Perhaps we should clarify that somewhere. (Contribution guide, other?)

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi all, sorry just waiting on changes, especially over large files. I noticed these changes were approved immediately, just want to make sure nothing bad happens.

@@ -0,0 +1,1000 @@
A very, very, very slow-moving, aimless movie about a distressed, drifting young man. 0
Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out. 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Labeled, not labelled, FYI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is British VS US spelling. And it is original dataset's name, I did not change it. But I can remove extra "l" :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When in Rome ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh really? Hmmm. I never knew that. How strange. I guess if that's the original name we should keep it.

mandyshieh and others added 2 commits May 23, 2018 10:30
* added ceiling to mathutils and have parquetloader default to sequential reading instead of throwing

* factor out sequence creation and add check for overflow in MathUtils
…ph variable outputs (#152)

* Adding support for training metrics in PipelineSweeperMacro and needed support files. Also includes new output information in PipelineSweeperMacro output graph to make consumption of returned pipelines easier.

* Changed where XML comment was placed.

* Added more tests (uncommented and fixed) for auto inference. Changed magic number in AutoMlUtils to be mix double value, per review comment.

* Added another test, TestPipelineSweeperMacroNoTransforms.

* Updated tests to include warning disabling (following Zeeshan S's example) to get build working.

* Changes to checks in AutoMlUtils (more correct usage of them).

* Fixing issue with ExceptParam using value and not name of parameters.

* Fixing errors on use of ExceptParam.
{
public class SentimentData
{
[Column("0")] public string SentimentText;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest putting the attributes on separate lines as that's more in line with default formatting.

{
private static string AppPath => Path.GetDirectoryName(Environment.GetCommandLineArgs()[0]);
private static string TrainDataPath => Path.Combine(AppPath, @"..\..\..\..\datasets\", "iris_train.txt");
private static string TestDataPath => Path.Combine(AppPath, @"..\..\..\..\datasets\", "iris_test.txt");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use backslashes as this doesn't work cross-platform. You should do:

Path.Combine(AppPath, "..", "..", "..", "..", "datasets", "iris_train.txt");

using Microsoft.ML.Trainers;
using Microsoft.ML.Transforms;
using Microsoft.ML;
using System.Threading.Tasks;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most developers prefer System go first and have all others alphabetically sorted afterwards.

@@ -0,0 +1,288 @@
## Ignore Visual Studio temporary files, build results, and
## files generated by popular Visual Studio add-ons.
##

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. We should only use the one from the root.

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
<appSettings>
<!--TODO: Please enter your own credentials here. -->

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a great solution for this. In general, the recommendation is to avoid using credentials from code and instead use certificates, user-based authentication, or a service like Azure KeyVault. Unfortunately this doesn't work for samples very well...

@@ -0,0 +1,2 @@
<wpf:ResourceDictionary xml:space="preserve" xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml" xmlns:s="clr-namespace:System;assembly=mscorlib" xmlns:ss="urn:shemas-jetbrains-com:settings-storage-xaml" xmlns:wpf="http://schemas.microsoft.com/winfx/2006/xaml/presentation">
<s:String x:Key="/Default/CodeInspection/CSharpLanguageProject/LanguageLevel/@EntryValue">CSharp72</s:String></wpf:ResourceDictionary>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shoudn't commit this file as this is from ReSharper ;-)


b. To work with labels from your GitHub repository, you will need to train the model on your data. To do so, export GitHub issues from your repository into `.tsv` file with the following columns:
* ID - issue’s ID
* Area - issue’s label (named this way to avoid confusion with the Label concept in ML.NET)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

yaeldMS and others added 3 commits May 23, 2018 11:30
Ivanidzo4ka and others added 8 commits May 23, 2018 16:22
* replace housing uci dataset to wine quality
Moved datasets from tests to examples
Removed model files
- Iris sample
- Sentiment Analysis sample
- TaxiFare sample
-GitHub issues classification sample
Moved datasets from tests to examples
Removed model files
@KrzysztofCwalina
Copy link
Member

This is a very nice PR, but shouldn't we keep all samples/examples in the same place? So far, we have been collecting such samples in tests/Microsoft/ML/Scenarios.

@OliaG
Copy link
Contributor Author

OliaG commented May 24, 2018

@KrzysztofCwalina regarding keep all samples/example in the same place, here is my understanding.

test/Microsoft/ML/Scenarios are integration tests the primary goal of which is to test our code end-to-end. We can run them to make sure there are no breaking changes introduced by new functionality.

The examples (this PR) are a way of showcasing our functionality to users. They are not tests but independent projects, sometimes applications containing additional logic (like posting requests to GitHub). The examples have to be discoverable for users (like root folder “examples”, it will be hard for users to find them here: test/Microsoft/ML/Scenarios) and F5-ble, users should be able do download just one example and successfully run it. You can’t do it with integration tests.

So for me they belong to different areas and that’s why I would keep tests in tests, and this samples in examples folder. It is a good idea though to mention these tests in docs or readme, so users can find additional use cases.

@@ -0,0 +1,191 @@
# IDV File Format

This document describes ML.NET's Binary dataview file format, version 1.1.1.5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks familiar... was this intended to be part of this PR? Was there a bad merge somewhere?

@OliaG OliaG closed this May 24, 2018
@OliaG OliaG mentioned this pull request May 24, 2018
@ghost ghost locked as resolved and limited conversation to collaborators Mar 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.