-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Schema based text loader #1878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schema based text loader #1878
Changes from 9 commits
f699cb0
3931d13
7cacf58
c69eb4b
31e17d9
e1201bb
e583b88
55e1bdd
01fab7a
c540d7f
ce25b69
0392712
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -131,6 +131,39 @@ var reader = mlContext.Data.CreateTextReader(new[] { | |
var data = reader.Read(dataPath); | ||
``` | ||
|
||
You can also create a data model class, and read the data based on this type. | ||
|
||
```csharp | ||
// The data model. This type will be used through the document. | ||
private class InspectedRow | ||
{ | ||
[LoadColumn(0)] | ||
public bool IsOver50K { get; set; } | ||
|
||
[LoadColumn(1)] | ||
public string Workclass { get; set; } | ||
|
||
[LoadColumn(2)] | ||
public string Education { get; set; } | ||
|
||
[LoadColumn(3)] | ||
public string MaritalStatus { get; set; } | ||
|
||
public string[] AllFeatures { get; set; } | ||
} | ||
|
||
// Create a new context for ML.NET operations. It can be used for exception tracking and logging, | ||
// as a catalog of available operations and as the source of randomness. | ||
var mlContext = new MLContext(); | ||
|
||
// Read the data into a data view. | ||
var data = mlContext.Data.ReadFromTextFile<InspectedRow>(dataPath, | ||
// First line of the file is a header, not a data row. | ||
hasHeader: true | ||
) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
please avoid tabs in this file #Resolved |
||
|
||
``` | ||
|
||
## How do I load data from multiple files? | ||
|
||
You can again use the `TextLoader`, and specify an array of files to its Read method. | ||
|
@@ -231,9 +264,9 @@ var mlContext = new MLContext(); | |
// Create the reader: define the data columns and where to find them in the text file. | ||
var reader = mlContext.Data.CreateTextReader(new[] { | ||
// We read the first 10 values as a single float vector. | ||
new TextLoader.Column("FeatureVector", DataKind.R4, new[] {new TextLoader.Range(0, 9)}), | ||
new TextLoader.Column("FeatureVector", DataKind.R4, new[] {new TextLoader.Range(0, 10)}), | ||
// Separately, read the target variable. | ||
new TextLoader.Column("Target", DataKind.R4, 10) | ||
new TextLoader.Column("Target", DataKind.R4, 11) | ||
}, | ||
// Default separator is tab, but we need a comma. | ||
s => s.Separator = ","); | ||
|
@@ -242,6 +275,30 @@ var reader = mlContext.Data.CreateTextReader(new[] { | |
var data = reader.Read(dataPath); | ||
``` | ||
|
||
Or by creating a data model for it: | ||
|
||
```csharp | ||
private class AdultData | ||
{ | ||
[LoadColumn("0", "10"), ColumnName("Features")] | ||
public float FeatureVector { get; } | ||
|
||
[LoadColumn(11)] | ||
public float Target { get; } | ||
} | ||
|
||
// Create a new context for ML.NET operations. It can be used for exception tracking and logging, | ||
// as a catalog of available operations and as the source of randomness. | ||
var mlContext = new MLContext(); | ||
|
||
// Read the data into a data view. | ||
var data = mlContext.Data.ReadFromTextFile<AdultData>(dataPath, | ||
// First line of the file is a header, not a data row. | ||
separator: ',' | ||
); | ||
|
||
``` | ||
|
||
## How do I debug my experiment or preview my pipeline? | ||
|
||
Most ML.NET operations are 'lazy': they are not actually processing data, they just validate that the operation is possible, and then defer execution until the output data is actually requested. This provides good efficiency, but makes it hard to step through and debug the experiment. | ||
|
@@ -342,33 +399,14 @@ var sameFeatureColumns = dynamicData.GetColumn<string[]>(mlContext, "AllFeatures | |
.Take(20).ToArray(); | ||
``` | ||
|
||
The above code assumes that we defined our `InspectedRow` class as follows: | ||
```csharp | ||
private class InspectedRow | ||
{ | ||
public bool IsOver50K; | ||
public string Workclass; | ||
public string Education; | ||
public string MaritalStatus; | ||
public string[] AllFeatures; | ||
} | ||
``` | ||
|
||
You can also use the dynamic API to create the equivalent of the previous pipeline. | ||
```csharp | ||
// Create a new context for ML.NET operations. It can be used for exception tracking and logging, | ||
// as a catalog of available operations and as the source of randomness. | ||
var mlContext = new MLContext(); | ||
|
||
// Create the reader: define the data columns and where to find them in the text file. | ||
var reader = mlContext.Data.CreateTextReader(new[] { | ||
// A boolean column depicting the 'label'. | ||
new TextLoader.Column("IsOver50K", DataKind.BL, 0), | ||
// Three text columns. | ||
new TextLoader.Column("Workclass", DataKind.TX, 1), | ||
new TextLoader.Column("Education", DataKind.TX, 2), | ||
new TextLoader.Column("MaritalStatus", DataKind.TX, 3) | ||
}, | ||
// Read the data into a data view. | ||
var data = mlContext.Data.ReadFromTextFile<InspectedRow>(dataPath, | ||
// First line of the file is a header, not a data row. | ||
hasHeader: true | ||
); | ||
|
@@ -377,10 +415,6 @@ var reader = mlContext.Data.CreateTextReader(new[] { | |
// together into one. | ||
var dynamicPipeline = mlContext.Transforms.Concatenate("AllFeatures", "Education", "MaritalStatus"); | ||
|
||
// Let's verify that the data has been read correctly. | ||
// First, we read the data file. | ||
var data = reader.Read(dataPath); | ||
|
||
// Fit our data pipeline and transform data with it. | ||
var transformedData = dynamicPipeline.Fit(data).Transform(data); | ||
|
||
|
@@ -476,22 +510,12 @@ var mlContext = new MLContext(); | |
|
||
// Step one: read the data as an IDataView. | ||
// First, we define the reader: specify the data columns and where to find them in the text file. | ||
var reader = mlContext.Data.CreateTextReader(new[] { | ||
// We read the first 11 values as a single float vector. | ||
new TextLoader.Column("FeatureVector", DataKind.R4, 0, 10), | ||
|
||
// Separately, read the target variable. | ||
new TextLoader.Column("Target", DataKind.R4, 11), | ||
}, | ||
// First line of the file is a header, not a data row. | ||
hasHeader: true, | ||
// Default separator is tab, but we need a semicolon. | ||
separatorChar: ';' | ||
// Read the data into a data view. Remember though, readers are lazy, so the actual reading will happen when the data is accessed. | ||
var trainData = mlContext.Data.ReadFromTextFile<AdultData>(dataPath, | ||
// First line of the file is a header, not a data row. | ||
separator: ',' | ||
); | ||
|
||
// Now read the file (remember though, readers are lazy, so the actual reading will happen when the data is accessed). | ||
var trainData = reader.Read(trainDataPath); | ||
|
||
// Sometime, caching data in-memory after its first access can save some loading time when the data is going to be used | ||
// several times somewhere. The caching mechanism is also lazy; it only caches things after being used. | ||
// User can replace all the subsequently uses of "trainData" with "cachedTrainData". We still use "trainData" because | ||
|
@@ -537,7 +561,10 @@ var metrics = mlContext.Regression.Evaluate(model.Transform(testData), label: r | |
Calculating the metrics with the dynamic API is as follows. | ||
```csharp | ||
// Read the test dataset. | ||
var testData = reader.Read(testDataPath); | ||
var testData = mlContext.Data.ReadFromTextFile<AdultData>(testDataPath, | ||
// First line of the file is a header, not a data row. | ||
separator: ',' | ||
); | ||
// Calculate metrics of the model on the test data. | ||
var metrics = mlContext.Regression.Evaluate(model.Transform(testData), label: "Target"); | ||
``` | ||
|
@@ -644,29 +671,19 @@ You can also use the dynamic API to create the equivalent of the previous pipeli | |
var mlContext = new MLContext(); | ||
|
||
// Step one: read the data as an IDataView. | ||
// First, we define the reader: specify the data columns and where to find them in the text file. | ||
var reader = mlContext.Data.CreateTextReader(new[] { | ||
new TextLoader.Column("SepalLength", DataKind.R4, 0), | ||
new TextLoader.Column("SepalWidth", DataKind.R4, 1), | ||
new TextLoader.Column("PetalLength", DataKind.R4, 2), | ||
new TextLoader.Column("PetalWidth", DataKind.R4, 3), | ||
// Label: kind of iris. | ||
new TextLoader.Column("Label", DataKind.TX, 4), | ||
}, | ||
// Retrieve the training data. | ||
var trainData = mlContext.Data.ReadFromTextFile<IrisInput>(irisDataPath, | ||
// Default separator is tab, but the dataset has comma. | ||
separatorChar: ',' | ||
separator: ',' | ||
); | ||
|
||
// Retrieve the training data. | ||
var trainData = reader.Read(irisDataPath); | ||
|
||
// Build the training pipeline. | ||
var dynamicPipeline = | ||
// Concatenate all the features together into one column 'Features'. | ||
mlContext.Transforms.Concatenate("Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth") | ||
// Note that the label is text, so it needs to be converted to key. | ||
.Append(mlContext.Transforms.Categorical.MapValueToKey("Label"), TransformerScope.TrainTest) | ||
// Cache data in moemory for steps after the cache check point stage. | ||
// Cache data in memory for steps after the cache check point stage. | ||
.AppendCacheCheckpoint(mlContext) | ||
// Use the multi-class SDCA model to predict the label using features. | ||
.Append(mlContext.MulticlassClassification.Trainers.StochasticDualCoordinateAscent()) | ||
|
@@ -937,24 +954,27 @@ var meanVarValues = normalizedData.GetColumn(r => r.MeanVarNormalized).ToArray() | |
|
||
You can achieve the same results using the dynamic API. | ||
```csharp | ||
//data model for the Iris class | ||
private class IrisInputAllFeatures | ||
{ | ||
// Unfortunately, we still need the dummy 'Label' column to be present. | ||
[ColumnName("Label"), LoadColumn(4)] | ||
public string IgnoredLabel { get; set; } | ||
|
||
[LoadColumn(4, loadAllOthers:true)] | ||
public float Features { get; set; } | ||
} | ||
|
||
// Create a new context for ML.NET operations. It can be used for exception tracking and logging, | ||
// as a catalog of available operations and as the source of randomness. | ||
var mlContext = new MLContext(); | ||
|
||
// Define the reader: specify the data columns and where to find them in the text file. | ||
var reader = mlContext.Data.CreateTextReader(new[] { | ||
// The four features of the Iris dataset will be grouped together as one Features column. | ||
new TextLoader.Column("Features", DataKind.R4, 0, 3), | ||
// Label: kind of iris. | ||
new TextLoader.Column("Label", DataKind.TX, 4), | ||
}, | ||
// Read the training data. | ||
var trainData = mlContext.Data.ReadFromTextFile<IrisInputAllFeatures>(dataPath, | ||
// Default separator is tab, but the dataset has comma. | ||
separatorChar: ',' | ||
separator: ',' | ||
); | ||
|
||
// Read the training data. | ||
var trainData = reader.Read(dataPath); | ||
|
||
// Apply all kinds of standard ML.NET normalization to the raw features. | ||
var pipeline = | ||
mlContext.Transforms.Normalize( | ||
|
@@ -1315,24 +1335,11 @@ You can achieve the same results using the dynamic API. | |
var mlContext = new MLContext(); | ||
|
||
// Step one: read the data as an IDataView. | ||
// First, we define the reader: specify the data columns and where to find them in the text file. | ||
var reader = mlContext.Data.CreateTextReader(new[] | ||
{ | ||
// We read the first 11 values as a single float vector. | ||
new TextLoader.Column("SepalLength", DataKind.R4, 0), | ||
new TextLoader.Column("SepalWidth", DataKind.R4, 1), | ||
new TextLoader.Column("PetalLength", DataKind.R4, 2), | ||
new TextLoader.Column("PetalWidth", DataKind.R4, 3), | ||
// Label: kind of iris. | ||
new TextLoader.Column("Label", DataKind.TX, 4), | ||
}, | ||
var data = mlContext.Data.ReadFromTextFile<IrisInput>(dataPath, | ||
// Default separator is tab, but the dataset has comma. | ||
separatorChar: ',' | ||
separator: ',' | ||
); | ||
|
||
// Read the data. | ||
var data = reader.Read(dataPath); | ||
|
||
// Build the training pipeline. | ||
var dynamicPipeline = | ||
// Concatenate all the features together into one column 'Features'. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
// Licensed to the .NET Foundation under one or more agreements. | ||
// The .NET Foundation licenses this file to you under the MIT license. | ||
// See the LICENSE file in the project root for more information. | ||
|
||
using Microsoft.ML.Runtime.Data; | ||
using System; | ||
using System.Collections.Generic; | ||
|
||
namespace Microsoft.ML.Data | ||
{ | ||
#pragma warning disable 618 | ||
/// <summary> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is the warning disable needed? I would document why its disabled. #Resolved |
||
/// Describes column information such as name and the source columns indices that this | ||
/// column encapsulates. | ||
/// </summary> | ||
[AttributeUsage(AttributeTargets.Field | AttributeTargets.Property, AllowMultiple = false, Inherited = true)] | ||
public sealed class LoadColumnAttribute : Attribute | ||
{ | ||
/// <summary> | ||
/// Initializes new instance of <see cref="LoadColumnAttribute"/>. | ||
/// </summary> | ||
/// <param name="columnIndex">The index of the column in the text file.</param> | ||
public LoadColumnAttribute(int columnIndex) | ||
: this(columnIndex.ToString()) | ||
{ | ||
Sources.Add(new TextLoader.Range(columnIndex)); | ||
} | ||
|
||
/// <summary> | ||
/// Initializes new instance of <see cref="LoadColumnAttribute"/>. | ||
/// </summary> | ||
/// <param name="start">The starting column index, for the range.</param> | ||
/// <param name="end">The ending column index, for the range.</param> | ||
public LoadColumnAttribute(int start, int end) | ||
: this(start.ToString()) //REVIEW this is incorrect, but it is just temporary there, until the Legacy API's TextLoader gets deleted. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
{ | ||
Sources.Add(new TextLoader.Range(start, end)); | ||
} | ||
|
||
/// <summary> | ||
/// Initializes new instance of <see cref="LoadColumnAttribute"/>. | ||
/// </summary> | ||
/// <param name="columnIndexes">Distinct text file column indices to load as part of this column.</param> | ||
public LoadColumnAttribute(int[] columnIndexes) | ||
: this(columnIndexes[0].ToString()) // REVIEW: this is incorrect, but it is just temporary there, until the Legacy API's TextLoader gets deleted. | ||
{ | ||
foreach (var col in columnIndexes) | ||
Sources.Add(new TextLoader.Range(col)); | ||
} | ||
|
||
[Obsolete("Should be deleted together with the Legacy project.")] | ||
private LoadColumnAttribute(string start) | ||
{ | ||
Sources = new List<TextLoader.Range>(); | ||
Start = start; | ||
} | ||
|
||
internal List<TextLoader.Range> Sources; | ||
|
||
[Obsolete("Should be deleted together with the Legacy project.")] | ||
[BestFriend] | ||
internal string Start { get; } | ||
} | ||
#pragma warning restore 618 | ||
} |
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this work? I am suspecting that reading this will throw #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does work because i changed the logic. See my questions about whether we should process only the annotated members/field, to make the data models more usable.
In reply to: 243104636 [](ancestors = 243104636)