Skip to content

Transforms components docs #1321

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Oct 30, 2018
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions docs/samples/Microsoft.ML.Samples/Dynamic/ConcatTransform.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

// the alignment of the usings with the methods is intentional so they can display on the same level in the docs site.
using Microsoft.ML.Runtime.Data;
using Microsoft.ML.Runtime.Api;
using Microsoft.ML.Data;
using System;
using System.Linq;
using System.Collections.Generic;

namespace Microsoft.ML.Samples.Dynamic
{
public partial class TransformSamples
{
class SampleInfertDataWithFeatures
{
public VBuffer<int> Features { get; set; }
}

public static void ConcatTransform()
{
// Create a new ML context, for ML.NET operations. It can be used for exception tracking and logging,
// as well as the source of randomness.
var ml = new MLContext(seed: 1, conc: 1);

// Get a small dataset as an IEnumerable.
IEnumerable<SamplesUtils.DatasetUtils.SampleInfertData> data = SamplesUtils.DatasetUtils.GetInfertData();
var trainData = ml.CreateStreamingDataView(data);

// Preview of the data.
// Age Case Education induced parity pooled.stratum row_num ...
// 26.0 1.0 0-5yrs 1.0 6.0 3.0 1.0 ...
// 42.0 1.0 0-5yrs 1.0 1.0 1.0 2.0 ...
// 39.0 1.0 0-5yrs 2.0 6.0 4.0 3.0 ...
// 34.0 1.0 0-5yrs 2.0 4.0 2.0 4.0 ...
// 35.0 1.0 6-11yrs 1.0 3.0 32.0 5.0 ...

// A pipeline for concatenating the age, parity and induced columns together in the Features column.
string outputColumnName = "Features";
var pipeline = new ConcatEstimator(ml, outputColumnName, new[] { "Age", "Parity", "Induced"});

// The transformed data.
var transformedData = pipeline.Fit(trainData).Transform(trainData);

// Getting the data of the newly created column as an IEnumerable of SampleInfertDataWithFeatures.
var featuresColumn = transformedData.AsEnumerable<SampleInfertDataWithFeatures>(ml, reuseRowObject: false);
Copy link
Contributor

@GalOshri GalOshri Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does reuseRowObject do? Would defaults be fine for these samples? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the enumerable, it determines whether to return the same object on every row, or allocate a new one per row. It is a required param; doesn't have a default.

For the settings of the transforms, i am using both defaults and non-defaults; since the purpose of this snippet is to educate about usage.


In reply to: 227591855 [](ancestors = 227591855)


Console.WriteLine($"{outputColumnName} column obtained post-transformation.");
foreach (var featureRow in featuresColumn)
{
foreach (var value in featureRow.Features.Values)
Console.Write($"{value} ");
Console.WriteLine("");
}

// Features
// 26 6 1
// 42 1 1
// 39 6 2
// 34 4 2
// 35 3 1
}
}
}
111 changes: 111 additions & 0 deletions docs/samples/Microsoft.ML.Samples/Dynamic/KeyToValue_Term.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

// the alignment of the usings with the methods is intentional so they can display on the same level in the docs site.
using Microsoft.ML.Data;
using Microsoft.ML.Runtime.Api;
using Microsoft.ML.Runtime.Data;
using Microsoft.ML.Transforms.Text;
using System;
using System.Collections.Generic;
using System.Linq;

namespace Microsoft.ML.Samples.Dynamic
{
public partial class TransformSamples
{
public static void KeyToValue_Term()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KeyToValue_Term [](start = 27, length = 15)

This is standing out, what this "_" mean, and why it cannot be KeyToValueAndTerm or KeyToValueThenTerm?

{
// Create a new ML context, for ML.NET operations. It can be used for exception tracking and logging,
// as well as the source of randomness.
var ml = new MLContext(seed: 1, conc: 1);

// Get a small dataset as an IEnumerable.
IEnumerable<SamplesUtils.DatasetUtils.SampleTopicsData> data = SamplesUtils.DatasetUtils.GetTopicsData();
var trainData = ml.CreateStreamingDataView(data);

// Preview of the topics data; a dataset that contains one column with two set of keys assigned to a body of text
// Review and ReviewReverse. The dataset will be used to classify how accurately the keys are assigned to the text.
Copy link

@shmoradims shmoradims Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReviewReverse [](start = 26, length = 13)

It's not clear to me what this column means. #Resolved

// Review, ReviewReverse, Label
// "animals birds cats dogs fish horse", "radiation galaxy universe duck", 1
// "horse birds house fish duck cats", "space galaxy universe radiation", 0
// "car truck driver bus pickup", "bus pickup", 1
// "car truck driver bus pickup horse", "car truck", 0
Copy link
Contributor

@justinormont justinormont Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to say the goal of the dataset. Eg: "The goal of the dataset to classify if the review matches ..."

I ask this, mainly as I'm reading the example, I have no idea what the labels represent vs. the data. #Resolved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. It's not obvious what the labels mean.


In reply to: 227236398 [](ancestors = 227236398)


// A pipeline to convert the terms of the review_reverse column in
// making use of default settings.
string defaultColumnName = "DefaultKeys";
// REVIEW create through the catalog extension
var default_pipeline = new WordTokenizeEstimator(ml, "ReviewReverse", "ReviewReverse")
Copy link

@shmoradims shmoradims Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReviewReverse [](start = 83, length = 13)

I think it's better to leave this out, if you want to do the transformation in-place. #Resolved

.Append(new TermEstimator(ml, "ReviewReverse" , defaultColumnName));

// Another pipeline, that customizes the advanced settings of the TermEstimator.
Copy link
Contributor

@justinormont justinormont Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to say why changing the hyperparameters of the TermEstimator is useful. Why keep only the first 10 ASCIIbetically ordered terms? #Pending

Copy link
Member Author

@sfilipi sfilipi Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see if you'd phrase it differently?


In reply to: 227236905 [](ancestors = 227236905)

// We can change the maxNumTerm to limit how many keys will get generated out of the set of words,
// and condition the order in which they get evaluated by changing sort from the default Occurence (order in which they get encountered)
// to value/alphabetically.
string customizedColumnName = "CustomizedKeys";
var customized_pipeline = new WordTokenizeEstimator(ml, "ReviewReverse", "ReviewReverse")
Copy link

@shmoradims shmoradims Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReviewReverse [](start = 86, length = 13)

plz remove for in-place transformation like above #Resolved

.Append(new TermEstimator(ml, "ReviewReverse", customizedColumnName, maxNumTerms: 10, sort:TermTransform.SortOrder.Value));

// The transformed data.
var transformedData_default = default_pipeline.Fit(trainData).Transform(trainData);
var transformedData_customized = customized_pipeline.Fit(trainData).Transform(trainData);

// Small helper to print the text inside the columns, in the console.
Action<string, IEnumerable<VBuffer<uint>>> printHelper = (columnName, column) =>
{
Console.WriteLine($"{columnName} column obtained post-transformation.");
foreach (var row in column)
{
foreach (var value in row.Values)
Console.Write($"{value} ");
Console.WriteLine("");
}

Console.WriteLine("===================================================");
};

// Preview of the DefaultKeys column obtained after processing the input.
var defaultColumn = transformedData_default.GetColumn<VBuffer<uint>>(ml, defaultColumnName);
printHelper(defaultColumnName, defaultColumn);

// DefaultKeys column obtained post-transformation.
// 1 2 3 4
// 5 2 3 1
// 6 7 3 1
// 8 9 3 1

// Previewing the CustomizedKeys column obtained after processing the input.
var customizedColumn = transformedData_customized.GetColumn<VBuffer<uint>>(ml, customizedColumnName);
printHelper(customizedColumnName, customizedColumn);

// CustomizedKeys
// 6 4 9 3
// 7 4 9 6
// 1 5 9 6
// 2 8 9 6

// Retrieve the original values, by appending the KeyToValue etimator to the existing pipelines
// to convert the keys back to the strings.
var pipeline = default_pipeline.Append(new KeyToValueEstimator(ml, defaultColumnName));
transformedData_default = pipeline.Fit(trainData).Transform(trainData);

// Preview of the DefaultColumnName column obtained.
var originalColumnBack = transformedData_default.GetColumn<VBuffer<ReadOnlyMemory<char>>>(ml, defaultColumnName);

foreach (var row in originalColumnBack)
{
foreach (var value in row.Values)
Console.Write($"{value} ");
Console.WriteLine("");
}

// DefaultKeys
// radiation galaxy universe duck
// space galaxy universe radiation
// bus pickup universe radiation
Copy link
Member Author

@sfilipi sfilipi Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

universe radiation [](start = 26, length = 18)

why is this here, log an issue post merge.

// car truck universe radiation
Copy link

@shmoradims shmoradims Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

universe radiation [](start = 25, length = 18)

this seems to be a bug too. #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, will investigate post PR.


In reply to: 228249405 [](ancestors = 228249405)

}
}
}
88 changes: 88 additions & 0 deletions docs/samples/Microsoft.ML.Samples/Dynamic/MinMaxNormalizer.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

// the alignment of the usings with the methods is intentional so they can display on the same level in the docs site.
using Microsoft.ML.Runtime.Data;
using Microsoft.ML.Runtime.Api;
using Microsoft.ML.Data;
using System;
using System.Collections.Generic;

namespace Microsoft.ML.Samples.Dynamic
{
public partial class TransformSamples
{
public static void MinMaxNormalizer()
{
// Create a new ML context, for ML.NET operations. It can be used for exception tracking and logging,
// as well as the source of randomness.
var ml = new MLContext(seed: 1, conc: 1);

// Get a small dataset as an IEnumerable and convert it to an IDataView.
IEnumerable<SamplesUtils.DatasetUtils.SampleInfertData> data = SamplesUtils.DatasetUtils.GetInfertData();
var trainData = ml.CreateStreamingDataView(data);

// Preview of the data.
// Age Case Education Induced Parity PooledStratum RowNum ...
// 26 1 0-5yrs 1 6 3 1 ...
// 42 1 0-5yrs 1 1 1 2 ...
// 39 1 0-5yrs 2 6 4 3 ...
// 34 1 0-5yrs 2 4 2 4 ...
// 35 1 6-11yrs 1 3 32 5 ...

// A pipeline for normalizing the Induced column.
var pipeline = ml.Transforms.Normalizer("Induced");
// The transformed (normalized according to Normalizer.NormalizerMode.MinMax) data.
var transformedData = pipeline.Fit(trainData).Transform(trainData);
// Getting the data of the newly created column, so we can preview it.
var normalizedColumn = transformedData.GetColumn<float>(ml, "Induced");

// A small printing utility.
Action<string, IEnumerable<float>> printHelper = (colName, column) =>
{
Console.WriteLine($"{colName} column obtained post-transformation.");
foreach (var row in column)
Console.WriteLine($"{row} ");
};

printHelper("Induced", normalizedColumn);

// Preview of the data.
// Induced
Copy link

@shmoradims shmoradims Oct 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Induced [](start = 15, length = 7)

nit: this doesn't match $"{colName} column obtained post-transformation." #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The col name is "Induced", see line 39.


In reply to: 227974775 [](ancestors = 227974775)

Copy link

@shmoradims shmoradims Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant if these comment are supposed to be the output from printHelper. If they are, then the headers don't match:
// Preview of the data.
// Induced

vs

Console.WriteLine($"{colName} column obtained post-transformation.");

Same thing applies to other comments showing data preview.


In reply to: 227982369 [](ancestors = 227982369,227974775)

// 0.5
// 0.5
// 1
// 1
// 0.5

// Composing a different pipeline if we wanted to normalize more than one column at a time.
// Using log scale as the normalization mode.
var multiColPipeline = ml.Transforms.Normalizer(Normalizer.NormalizerMode.LogMeanVariance, new[] { ("Induced", "LogInduced"), ("Spontaneous", "LogSpontaneous") });
// The transformed data.
var multiColtransformedData = multiColPipeline.Fit(trainData).Transform(trainData);

// Getting the newly created columns.
var normalizedInduced = multiColtransformedData.GetColumn<float>(ml, "LogInduced");
var normalizedSpont = multiColtransformedData.GetColumn<float>(ml, "LogSpontaneous");

printHelper("LogInduced", normalizedInduced);

// LogInduced
// 0.2071445
// 0.2071445
// 0.889631
// 0.889631
// 0.2071445

printHelper("LogSpontaneous", normalizedSpont);

// LogSpontaneous
// 0.8413026
// 0
// 0
// 0
// 0.1586974
}
}
}
84 changes: 84 additions & 0 deletions docs/samples/Microsoft.ML.Samples/Dynamic/TextTransform.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

// the alignment of the usings with the methods is intentional so they can display on the same level in the docs site.
using Microsoft.ML.Runtime.Data;
using Microsoft.ML.Runtime.Api;
using Microsoft.ML.Data;
using System;
using System.Collections.Generic;

namespace Microsoft.ML.Samples.Dynamic
{
public partial class TransformSamples
{
public static void TextTransform()
{
// Create a new ML context, for ML.NET operations. It can be used for exception tracking and logging,
// as well as the source of randomness.
var ml = new MLContext(seed: 1, conc: 1);

// Get a small dataset as an IEnumerable and convert to IDataView.
IEnumerable<SamplesUtils.DatasetUtils.SampleSentimentData> data = SamplesUtils.DatasetUtils.GetSentimentData();
var trainData = ml.CreateStreamingDataView(data);

// Preview of the data.
// Sentiment SentimentText
// true Best game I've ever played.
// false ==RUDE== Dude, 2.
// true Until the next game, this is the best Xbox game!

// A pipeline for featurization of the "SentimentText" column, and placing the output in a new column named "DefaultTextFeatures"
// The pipeline uses the default settings to featurize.
string defaultColumnName = "DefaultTextFeatures";
var default_pipeline = ml.Transforms.Text.FeaturizeText("SentimentText", defaultColumnName);

// Another pipeline, that customizes the advanced settings of the FeaturizeText transformer.
string customizedColumnName = "CustomizedTextFeatures";
var customized_pipeline = ml.Transforms.Text.FeaturizeText("SentimentText", customizedColumnName, s =>
{
s.KeepPunctuations = false;
s.KeepNumbers = false;
s.OutputTokens = true;
s.TextLanguage = Runtime.Data.TextTransform.Language.English; // supports English, French, German, Dutch, Italian, Spanish, Japanese
});

// The transformed data for both pipelines.
var transformedData_default = default_pipeline.Fit(trainData).Transform(trainData);
var transformedData_customized = customized_pipeline.Fit(trainData).Transform(trainData);

// Small helper to print the text inside the columns, in the console.
Action<string, IEnumerable<VBuffer<float>>> printHelper = (columnName, column) =>
{
Console.WriteLine($"{columnName} column obtained post-transformation.");
foreach (var featureRow in column)
{
foreach (var value in featureRow.Values)
Console.Write($"{value} ");
Console.WriteLine("");
}

Console.WriteLine("===================================================");
};

// Preview of the DefaultTextFeatures column obtained after processing the input.
var defaultColumn = transformedData_default.GetColumn<VBuffer<float>>(ml, defaultColumnName);
printHelper(defaultColumnName, defaultColumn);

// DefaultTextFeatures
// 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.4472136 0.4472136 0.4472136 0.4472136 0.4472136
// 0.2357023 0.2357023 0.2357023 0.2357023 0.4714046 0.2357023 0.2357023 0.2357023 0.2357023 0.2357023 0.2357023 0.2357023 0.2357023 0.2357023 0.2357023 0.5773503 0.5773503 0.5773503 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.1924501 0.4472136 0.4472136 0.4472136 0.4472136 0.4472136
// 0 0.1230915 0.1230915 0.1230915 0.1230915 0.246183 0.246183 0.246183 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.1230915 0 0 0.1230915 0.1230915 0.1230915 0.1230915 0.1230915 0.1230915 0.3692745 0.246183 0.246183 0.1230915 0.1230915 0.1230915 0.1230915 0.1230915 0.1230915 0.1230915 0.1230915 0.1230915 0.246183 0.1230915 0.1230915 0.1230915 0.1230915 0.1230915 0.1230915 0.1230915 0.1230915 0.1230915 0.1230915 0.1230915 0.1230915 0.1230915 0.2886751 0 0 0 0 0 0 0 0.2886751 0.5773503 0.2886751 0.2886751 0.2886751 0.2886751 0.2886751 0.2886751

// Preview of the CustomizedTextFeatures column obtained after processing the input.
var customizedColumn = transformedData_customized.GetColumn<VBuffer<float>>(ml, customizedColumnName);
printHelper(customizedColumnName, customizedColumn);

// CustomizedTextFeatures
// 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.4472136 0.4472136 0.4472136 0.4472136 0.4472136
// 0.25 0.25 0.25 0.25 0.5 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.7071068 0.7071068 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.4472136 0.4472136 0.4472136 0.4472136 0.4472136
// 0 0.125 0.125 0.125 0.125 0.25 0.25 0.25 0.125 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.125 0.125 0.125 0.125 0.125 0.125 0.375 0.25 0.25 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.25 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.2672612 0.5345225 0 0 0 0 0 0.2672612 0.5345225 0.2672612 0.2672612 0.2672612 0.2672612 }
}
}
}
5 changes: 2 additions & 3 deletions docs/samples/Microsoft.ML.Samples/Program.cs
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,13 @@
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

namespace Microsoft.ML.Samples
namespace Microsoft.ML.Samples.Dynamic
{
internal static class Program
{
static void Main(string[] args)
{
Trainers.SdcaRegression();
Transformers.ConcatEstimator();
TransformSamples.MinMaxNormalizer();
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
// See the LICENSE file in the project root for more information.

// the alignment of the usings with the methods is intentional so they can display on the same level in the docs site.
using Microsoft.ML.Runtime.Api;
using Microsoft.ML.Runtime.Api;
using Microsoft.ML.Runtime.Data;
using Microsoft.ML.StaticPipe;
using System;
Expand All @@ -12,15 +12,15 @@
// NOTE: WHEN ADDING TO THE FILE, ALWAYS APPEND TO THE END OF IT.
// If you change the existinc content, check that the files referencing it in the XML documentation are still correct, as they reference
// line by line.
namespace Microsoft.ML.Samples
namespace Microsoft.ML.Samples.Static
{
public static class Transformers
public partial class TransformSamples
{

/// <summary>
/// The example for the statically typed concat estimator.
/// </summary>
public static void ConcatEstimator()
public static void ConcatWith()
{
// Create a new environment for ML.NET operations. It can be used for exception tracking and logging,
// as well as the source of randomness.
Expand Down
Loading