Skip to content

Added a test showing example of text classification using TensorFlow in ML.Net #2302

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jan 30, 2019
Merged
2 changes: 1 addition & 1 deletion build/Dependencies.props
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
<SystemDrawingCommonPackageVersion>4.5.0</SystemDrawingCommonPackageVersion>
<SystemIOFileSystemAccessControl>4.5.0</SystemIOFileSystemAccessControl>
<SystemSecurityPrincipalWindows>4.5.0</SystemSecurityPrincipalWindows>
<TensorFlowVersion>1.10.0</TensorFlowVersion>
<TensorFlowVersion>1.12.0</TensorFlowVersion>
</PropertyGroup>

<!-- Code Analyzer Dependencies -->
Expand Down
17 changes: 17 additions & 0 deletions src/Microsoft.ML.Data/Transforms/ConversionsExtensionsCatalog.cs
Original file line number Diff line number Diff line change
Expand Up @@ -141,5 +141,22 @@ public static ValueMappingEstimator<TInputType, TOutputType> ValueMap<TInputType
IEnumerable<TOutputType> values,
params (string source, string name)[] columns)
=> new ValueMappingEstimator<TInputType, TOutputType>(CatalogUtils.GetEnvironment(catalog), keys, values, columns);

/// <summary>
/// Maps the <paramref name="columns.input"/> using the keys in the dictionary to the values of dictionary i.e.
/// a value 'x' in the <paramref name="columns.input"/> would be mappped to a value stored in dictionary[x].
/// In this case, the <paramref name="lookupMap"/> is used to build up the dictionary where <paramref name="keyColumn"/>
/// and <paramref name="valueColumn"/> specify the keys and values of dictionary respectively.
Copy link
Member

@wschin wschin Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a value x in the input would be mapped to value stored in dictionary[x]? #Resolved

/// </summary>
/// <param name="catalog">The categorical transform's catalog</param>
/// <param name="lookupMap">An instance of <see cref="IDataView"/> that contains the key and value columns.</param>
/// <param name="keyColumn">Name of the key column in <paramref name="lookupMap"/>.</param>
/// <param name="valueColumn">Name of the value column in <paramref name="lookupMap"/>.</param>
/// <param name="columns">The columns to apply this transform on.</param>
/// <returns></returns>
public static ValueMappingEstimator ValueMap(
this TransformsCatalog.ConversionTransforms catalog,
IDataView lookupMap, string keyColumn, string valueColumn, params (string input, string output)[] columns)
Copy link
Member

@eerhardt eerhardt Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomFinley @sfilipi - is this consistent with the order we've decided on with #2064? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts guys? I saw the method above has same pattern so I followed that.


In reply to: 252014795 [](ancestors = 252014795)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(string outputColumnName, string inputColumnName)

You'll see that if you update to latest.


In reply to: 252027697 [](ancestors = 252027697,252014795)

=> new ValueMappingEstimator(CatalogUtils.GetEnvironment(catalog), lookupMap, keyColumn, valueColumn, columns);
}
}

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
090706417EC29D91EEEABC5C25576374A86426CF25F27556C0EED4FD815D814C4F09FA7389ED8F614E4B34BF6438B9AE0ADA402BEA7CC9441446AB783A6F187D

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
5359609DDF69D66474F720D6A1ED669942FEB6842096CFC3EAF44B84FA3F2F659829778446BD3C7C83871F7293CA481AC4732DF6DC7921ADA100B459E37198BD

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
49DB72CDD8D10B78BB1CD17A058DF508E04B38BD287FF53EB9173A48D3994E11741B1EE6C9108303739819845F2F9D777EE3E767D737C24DB3A28B67FF68C951
2 changes: 1 addition & 1 deletion test/Microsoft.ML.Tests/Microsoft.ML.Tests.csproj
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@
<NativeAssemblyReference Condition="'$(OS)' != 'Windows_NT'" Include="tensorflow_framework" />
</ItemGroup>
<ItemGroup>
<PackageReference Include="Microsoft.ML.TensorFlow.TestModels" Version="0.0.6-test" />
<PackageReference Include="Microsoft.ML.TensorFlow.TestModels" Version="0.0.7-test" />
<PackageReference Include="Microsoft.ML.Onnx.TestModels" Version="0.0.2-test" />
</ItemGroup>
</Project>
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,10 @@
using Microsoft.ML.ImageAnalytics;
using Microsoft.ML.RunTests;
using Microsoft.ML.Transforms;
using Microsoft.ML.Transforms.Conversions;
using Microsoft.ML.Transforms.Normalizers;
using Microsoft.ML.Transforms.TensorFlow;
using Microsoft.ML.Transforms.Text;
using Xunit;

namespace Microsoft.ML.Scenarios
Expand Down Expand Up @@ -846,5 +848,59 @@ public void TensorFlowTransformCifarInvalidShape()
}
Assert.True(thrown);
}

/// <summary>
/// Class to hold features and predictions.
/// </summary>
public class TensorFlowSentiment
{
public string Sentiment_Text;
[VectorType(600)]
public int[] Features;
[VectorType(2)]
public float[] Prediction;
}

[ConditionalFact(typeof(Environment), nameof(Environment.Is64BitProcess))]
public void TensorFlowSentimentClassificationTest()
Copy link
Contributor Author

@zeahmed zeahmed Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is going to fail as Microsoft.ML.TensorFlow.TestModels nuget is not updated yet. #Resolved

{
var mlContext = new MLContext(seed: 1, conc: 1);
var data = new[] { new TensorFlowSentiment() { Sentiment_Text = "this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert is an amazing actor and now the same being director father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also to the two little boy's that played the of norman and paul they were just brilliant children are often left out of the list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all" } };
var dataView = mlContext.Data.ReadFromEnumerable(data);

var lookupMap = mlContext.Data.ReadFromTextFile(@"sentiment_model/imdb_word_index.csv",
columns: new[]
{
new TextLoader.Column("Words", DataKind.TX, 0),
new TextLoader.Column("Ids", DataKind.I4, 1),
},
separatorChar: ','
);

// We cannot resize variable length vector to fixed length vector in ML.NET
// The trick here is to create two pipelines.
// The first pipeline 'dataPipe' tokenzies the string into words and maps each word to an integer which is an index in the dictionary.
// Then this integer vector is retrieved from the pipeline and resized to fixed length.
// The second pipeline 'tfEnginePipe' takes the resized integer vector and passed to TensoFlow and get the classification scores.
var estimator = mlContext.Transforms.Text.TokenizeWords("Sentiment_Text", "TokenizedWords")
Copy link
Member

@sfilipi sfilipi Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okenizeWords("Sentiment_Text", "TokenizedWords" [](start = 55, length = 47)

if you rebase to latest, you'll have to swap those. #Resolved

.Append(mlContext.Transforms.Conversion.ValueMap(lookupMap, "Words", "Ids", new[] { ("TokenizedWords", "Features") }));
Copy link
Member

@abgoswam abgoswam Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValueMap [](start = 56, length = 8)

  • Is this transform doing the re-sizing to the fixed length ?

  • Does it matter what the fixed length is ? I presume the model was built with fixed length shaped input. But I do not see the shape specified #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resizing is done at 893 in C# code.


In reply to: 251976925 [](ancestors = 251976925)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see shape specified in line 893


In reply to: 251977773 [](ancestors = 251977773,251976925)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry not 893. Its line 899


In reply to: 251978523 [](ancestors = 251978523,251977773,251976925)

Copy link
Member

@abgoswam abgoswam Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are calling the Predict API and hence in Line #899 we can resize the output of the first pipeline.

How would this work for the 'Transform' API... where the testData has 2 rows (say)

row1 -> "Hi" -> dimension 50 (say)
row2 -> "(some long sentence)" -> dimension 5000 (say)

Will it work ?


In reply to: 251978928 [](ancestors = 251978928,251978523,251977773,251976925)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are not training the TF model at all. It is just the prediction pipeline. For the case you are mentioning, it would require the same resize operation on dataview instead of single prediction.


In reply to: 251981308 [](ancestors = 251981308,251978928,251978523,251977773,251976925)

Copy link
Member

@abgoswam abgoswam Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand we are not training the TF model. The Fit() for the TFTransform would not do anyting in this example.

I wanted to know if the re-size operation on dataview would be supported -- If it is supported, can we add it to the unit test with at least 2 rows of text data + use of the Transform API ?

This test case does single prediction (use of Predict API) but does not show use of the Transform API where we would have to re-size more than just 1 row of variable length vector.


In reply to: 252003040 [](ancestors = 252003040,251981308,251978928,251978523,251977773,251976925)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not in the scope of this test. I will try to add more training related test later on but not in this PR because of the scope.


In reply to: 252020455 [](ancestors = 252020455,252003040,251981308,251978928,251978523,251977773,251976925)

var dataPipe = estimator.Fit(dataView)
Copy link
Member

@abgoswam abgoswam Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataPipe [](start = 16, length = 8)

is there a particular reason why we have dataPipe and tfEnginePipe separate ? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments in the code.


In reply to: 251916406 [](ancestors = 251916406)

.CreatePredictionEngine<TensorFlowSentiment, TensorFlowSentiment>(mlContext);

// For explanation on how was the `sentiment_model` created
// c.f. https://github.com/dotnet/machinelearning-testdata/blob/master/Microsoft.ML.TensorFlow.TestModels/sentiment_model/README.md
string modelLocation = @"sentiment_model";
Copy link
Member

@abgoswam abgoswam Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sentiment_model [](start = 37, length = 15)

so this TF model takes as input a vector of floats. Am i right ?

Perhaps we should add a comment how the model was created etc. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it takes integers as input.


In reply to: 251917695 [](ancestors = 251917695)

var tfEnginePipe = mlContext.Transforms.ScoreTensorFlowModel(modelLocation, new[] { "Features" }, new[] { "Prediction/Softmax" })
.Append(mlContext.Transforms.CopyColumns(("Prediction/Softmax", "Prediction")))
.Fit(dataView)
.CreatePredictionEngine<TensorFlowSentiment, TensorFlowSentiment>(mlContext);

var processedData = dataPipe.Predict(data[0]);
Array.Resize(ref processedData.Features, 600);
var prediction = tfEnginePipe.Predict(processedData);

Assert.Equal(2, prediction.Prediction.Length);
Copy link
Member

@eerhardt eerhardt Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we verify that the predictions were somewhat correct? #Resolved

Assert.InRange(prediction.Prediction[1], 0.650032759 - 0.01, 0.650032759 + 0.01);
Copy link
Member

@wschin wschin Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use Assert.Equal. If there are only two prediction values, can we check them all? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by Assert.Equal? Here we are checking the range within particular threshold.
No need to check another value. These are probabilities.


In reply to: 252005383 [](ancestors = 252005383)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I got you what you meant with Assert.Equal. I actually want to check if my values are in range e.g. 0.64 <= prediction <= 0.66 which I cannot do with Assert.Equal, can I?
Also, I feel InRange more readable than other when asserting thresholds.


In reply to: 252006645 [](ancestors = 252006645,252005383)

Copy link
Member

@wschin wschin Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have tolerance in Assert.Equal. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It uses number of decimal places which is not applicable here.


In reply to: 252050957 [](ancestors = 252050957)

Copy link
Contributor

@justinormont justinormont Jan 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could trim 0.650032759 to 0.65, if we're comparing as ± 0.01.

}
}
}