Skip to content

Schema based text loader #1878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Dec 21, 2018
188 changes: 100 additions & 88 deletions docs/code/MlNetCookBook.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,44 @@ var reader = mlContext.Data.CreateTextReader(new[] {
var data = reader.Read(dataPath);
```

You can also create a data model class, and read the data based on this type.

```csharp
// The data model. This type will be used through the document.
private class InspectedRow
{
[LoadColumn(0)]
public bool IsOver50K { get; set; }

[LoadColumn(1)]
public string Workclass { get; set; }

[LoadColumn(2)]
public string Education { get; set; }

[LoadColumn(3)]
public string MaritalStatus { get; set; }

public string[] AllFeatures { get; set; }
}

private class InspectedRowWithAllFeatures : InspectedRow
{
public string[] AllFeatures { get; set; }
}

// Create a new context for ML.NET operations. It can be used for exception tracking and logging,
// as a catalog of available operations and as the source of randomness.
var mlContext = new MLContext();

// Read the data into a data view.
var data = mlContext.Data.ReadFromTextFile<InspectedRow>(dataPath,
// First line of the file is a header, not a data row.
hasHeader: true
)
Copy link
Contributor

@Zruty0 Zruty0 Dec 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[](start = 1, length = 1)

please avoid tabs in this file #Resolved


```

## How do I load data from multiple files?

You can again use the `TextLoader`, and specify an array of files to its Read method.
Expand Down Expand Up @@ -214,7 +252,7 @@ var reader = mlContext.Data.CreateTextReader(ctx => (
Target: ctx.LoadFloat(11)
),
// Default separator is tab, but we need a comma.
separator: ',');
separatorChar: ',');


// Now read the file (remember though, readers are lazy, so the actual reading will happen when the data is accessed).
Expand All @@ -231,17 +269,41 @@ var mlContext = new MLContext();
// Create the reader: define the data columns and where to find them in the text file.
var reader = mlContext.Data.CreateTextReader(new[] {
// We read the first 10 values as a single float vector.
new TextLoader.Column("FeatureVector", DataKind.R4, new[] {new TextLoader.Range(0, 9)}),
new TextLoader.Column("FeatureVector", DataKind.R4, new[] {new TextLoader.Range(0, 10)}),
// Separately, read the target variable.
new TextLoader.Column("Target", DataKind.R4, 10)
new TextLoader.Column("Target", DataKind.R4, 11)
},
// Default separator is tab, but we need a comma.
s => s.Separator = ",");
separatorChar: ',');

// Now read the file (remember though, readers are lazy, so the actual reading will happen when the data is accessed).
var data = reader.Read(dataPath);
```

Or by creating a data model for it:

```csharp
private class AdultData
{
[LoadColumn("0", "10"), ColumnName("Features")]
public float FeatureVector { get; }

[LoadColumn(11)]
public float Target { get; }
}

// Create a new context for ML.NET operations. It can be used for exception tracking and logging,
// as a catalog of available operations and as the source of randomness.
var mlContext = new MLContext();

// Read the data into a data view.
var data = mlContext.Data.ReadFromTextFile<AdultData>(dataPath,
// First line of the file is a header, not a data row.
separatorChar: ','
);

```

## How do I debug my experiment or preview my pipeline?

Most ML.NET operations are 'lazy': they are not actually processing data, they just validate that the operation is possible, and then defer execution until the output data is actually requested. This provides good efficiency, but makes it hard to step through and debug the experiment.
Expand Down Expand Up @@ -325,7 +387,7 @@ var transformedData = dataPipeline.Fit(data).Transform(data);
// 'transformedData' is a 'promise' of data. Let's actually read it.
var someRows = transformedData.AsDynamic
// Convert to an enumerable of user-defined type.
.AsEnumerable<InspectedRow>(mlContext, reuseRowObject: false)
.AsEnumerable<InspectedRowWithAllFeatures>(mlContext, reuseRowObject: false)
// Take a couple values as an array.
.Take(4).ToArray();

Expand All @@ -342,33 +404,14 @@ var sameFeatureColumns = dynamicData.GetColumn<string[]>(mlContext, "AllFeatures
.Take(20).ToArray();
```

The above code assumes that we defined our `InspectedRow` class as follows:
```csharp
private class InspectedRow
{
public bool IsOver50K;
public string Workclass;
public string Education;
public string MaritalStatus;
public string[] AllFeatures;
}
```

You can also use the dynamic API to create the equivalent of the previous pipeline.
```csharp
// Create a new context for ML.NET operations. It can be used for exception tracking and logging,
// as a catalog of available operations and as the source of randomness.
var mlContext = new MLContext();

// Create the reader: define the data columns and where to find them in the text file.
var reader = mlContext.Data.CreateTextReader(new[] {
// A boolean column depicting the 'label'.
new TextLoader.Column("IsOver50K", DataKind.BL, 0),
// Three text columns.
new TextLoader.Column("Workclass", DataKind.TX, 1),
new TextLoader.Column("Education", DataKind.TX, 2),
new TextLoader.Column("MaritalStatus", DataKind.TX, 3)
},
// Read the data into a data view.
var data = mlContext.Data.ReadFromTextFile<InspectedRow>(dataPath,
// First line of the file is a header, not a data row.
hasHeader: true
);
Expand All @@ -377,17 +420,13 @@ var reader = mlContext.Data.CreateTextReader(new[] {
// together into one.
var dynamicPipeline = mlContext.Transforms.Concatenate("AllFeatures", "Education", "MaritalStatus");

// Let's verify that the data has been read correctly.
// First, we read the data file.
var data = reader.Read(dataPath);

// Fit our data pipeline and transform data with it.
var transformedData = dynamicPipeline.Fit(data).Transform(data);

// 'transformedData' is a 'promise' of data. Let's actually read it.
var someRows = transformedData
// Convert to an enumerable of user-defined type.
.AsEnumerable<InspectedRow>(mlContext, reuseRowObject: false)
.AsEnumerable<InspectedRowWithAllFeatures>(mlContext, reuseRowObject: false)
// Take a couple values as an array.
.Take(4).ToArray();

Expand Down Expand Up @@ -431,7 +470,7 @@ var reader = mlContext.Data.CreateTextReader(ctx => (
// The data file has header.
hasHeader: true,
// Default separator is tab, but we need a semicolon.
separator: ';');
separatorChar: ';');


// Now read the file (remember though, readers are lazy, so the actual reading will happen when the data is accessed).
Expand Down Expand Up @@ -476,22 +515,12 @@ var mlContext = new MLContext();

// Step one: read the data as an IDataView.
// First, we define the reader: specify the data columns and where to find them in the text file.
var reader = mlContext.Data.CreateTextReader(new[] {
// We read the first 11 values as a single float vector.
new TextLoader.Column("FeatureVector", DataKind.R4, 0, 10),

// Separately, read the target variable.
new TextLoader.Column("Target", DataKind.R4, 11),
},
// Read the data into a data view. Remember though, readers are lazy, so the actual reading will happen when the data is accessed.
var trainData = mlContext.Data.ReadFromTextFile<AdultData>(dataPath,
// First line of the file is a header, not a data row.
hasHeader: true,
// Default separator is tab, but we need a semicolon.
separatorChar: ';'
separatorChar: ','
);

// Now read the file (remember though, readers are lazy, so the actual reading will happen when the data is accessed).
var trainData = reader.Read(trainDataPath);

// Sometime, caching data in-memory after its first access can save some loading time when the data is going to be used
// several times somewhere. The caching mechanism is also lazy; it only caches things after being used.
// User can replace all the subsequently uses of "trainData" with "cachedTrainData". We still use "trainData" because
Expand Down Expand Up @@ -537,7 +566,10 @@ var metrics = mlContext.Regression.Evaluate(model.Transform(testData), label: r
Calculating the metrics with the dynamic API is as follows.
```csharp
// Read the test dataset.
var testData = reader.Read(testDataPath);
var testData = mlContext.Data.ReadFromTextFile<AdultData>(testDataPath,
// First line of the file is a header, not a data row.
separatorChar: ','
);
// Calculate metrics of the model on the test data.
var metrics = mlContext.Regression.Evaluate(model.Transform(testData), label: "Target");
```
Expand Down Expand Up @@ -605,7 +637,7 @@ var reader = mlContext.Data.CreateTextReader(ctx => (
Label: ctx.LoadText(4)
),
// Default separator is tab, but the dataset has comma.
separator: ',');
separatorChar: ',');

// Retrieve the training data.
var trainData = reader.Read(irisDataPath);
Expand Down Expand Up @@ -644,29 +676,19 @@ You can also use the dynamic API to create the equivalent of the previous pipeli
var mlContext = new MLContext();

// Step one: read the data as an IDataView.
// First, we define the reader: specify the data columns and where to find them in the text file.
var reader = mlContext.Data.CreateTextReader(new[] {
new TextLoader.Column("SepalLength", DataKind.R4, 0),
new TextLoader.Column("SepalWidth", DataKind.R4, 1),
new TextLoader.Column("PetalLength", DataKind.R4, 2),
new TextLoader.Column("PetalWidth", DataKind.R4, 3),
// Label: kind of iris.
new TextLoader.Column("Label", DataKind.TX, 4),
},
// Retrieve the training data.
var trainData = mlContext.Data.ReadFromTextFile<IrisInput>(irisDataPath,
// Default separator is tab, but the dataset has comma.
separatorChar: ','
);

// Retrieve the training data.
var trainData = reader.Read(irisDataPath);

// Build the training pipeline.
var dynamicPipeline =
// Concatenate all the features together into one column 'Features'.
mlContext.Transforms.Concatenate("Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth")
// Note that the label is text, so it needs to be converted to key.
.Append(mlContext.Transforms.Categorical.MapValueToKey("Label"), TransformerScope.TrainTest)
// Cache data in moemory for steps after the cache check point stage.
// Cache data in memory for steps after the cache check point stage.
.AppendCacheCheckpoint(mlContext)
// Use the multi-class SDCA model to predict the label using features.
.Append(mlContext.MulticlassClassification.Trainers.StochasticDualCoordinateAscent())
Expand Down Expand Up @@ -821,7 +843,7 @@ var reader = mlContext.Data.CreateTextReader(ctx => (
Label: ctx.LoadText(4)
),
// Default separator is tab, but the dataset has comma.
separator: ',');
separatorChar: ',');

// Retrieve the training data.
var trainData = reader.Read(dataPath);
Expand Down Expand Up @@ -914,7 +936,7 @@ var reader = mlContext.Data.CreateTextReader(ctx => (
Label: ctx.LoadText(4)
),
// Default separator is tab, but the dataset has comma.
separator: ',');
separatorChar: ',');

// Read the training data.
var trainData = reader.Read(dataPath);
Expand All @@ -937,24 +959,27 @@ var meanVarValues = normalizedData.GetColumn(r => r.MeanVarNormalized).ToArray()

You can achieve the same results using the dynamic API.
```csharp
//data model for the Iris class
private class IrisInputAllFeatures
{
// Unfortunately, we still need the dummy 'Label' column to be present.
[ColumnName("Label"), LoadColumn(4)]
public string IgnoredLabel { get; set; }

[LoadColumn(4, loadAllOthers:true)]
public float Features { get; set; }
}

// Create a new context for ML.NET operations. It can be used for exception tracking and logging,
// as a catalog of available operations and as the source of randomness.
var mlContext = new MLContext();

// Define the reader: specify the data columns and where to find them in the text file.
var reader = mlContext.Data.CreateTextReader(new[] {
// The four features of the Iris dataset will be grouped together as one Features column.
new TextLoader.Column("Features", DataKind.R4, 0, 3),
// Label: kind of iris.
new TextLoader.Column("Label", DataKind.TX, 4),
},
// Read the training data.
var trainData = mlContext.Data.ReadFromTextFile<IrisInputAllFeatures>(dataPath,
// Default separator is tab, but the dataset has comma.
separatorChar: ','
);

// Read the training data.
var trainData = reader.Read(dataPath);

// Apply all kinds of standard ML.NET normalization to the raw features.
var pipeline =
mlContext.Transforms.Normalize(
Expand Down Expand Up @@ -1270,7 +1295,7 @@ var reader = mlContext.Data.CreateTextReader(ctx => (
Label: ctx.LoadText(4)
),
// Default separator is tab, but the dataset has comma.
separator: ',');
separatorChar: ',');

// Read the data.
var data = reader.Read(dataPath);
Expand Down Expand Up @@ -1315,24 +1340,11 @@ You can achieve the same results using the dynamic API.
var mlContext = new MLContext();

// Step one: read the data as an IDataView.
// First, we define the reader: specify the data columns and where to find them in the text file.
var reader = mlContext.Data.CreateTextReader(new[]
{
// We read the first 11 values as a single float vector.
new TextLoader.Column("SepalLength", DataKind.R4, 0),
new TextLoader.Column("SepalWidth", DataKind.R4, 1),
new TextLoader.Column("PetalLength", DataKind.R4, 2),
new TextLoader.Column("PetalWidth", DataKind.R4, 3),
// Label: kind of iris.
new TextLoader.Column("Label", DataKind.TX, 4),
},
var data = mlContext.Data.ReadFromTextFile<IrisInput>(dataPath,
// Default separator is tab, but the dataset has comma.
separatorChar: ','
);

// Read the data.
var data = reader.Read(dataPath);

// Build the training pipeline.
var dynamicPipeline =
// Concatenate all the features together into one column 'Features'.
Expand Down Expand Up @@ -1390,7 +1402,7 @@ var reader = mlContext.Data.CreateTextReader(ctx => (
Label: ctx.LoadText(4)
),
// Default separator is tab, but the dataset has comma.
separator: ',');
separatorChar: ',');

// Read the data.
var data = reader.Read(dataPath);
Expand Down
13 changes: 0 additions & 13 deletions src/Microsoft.ML.Data/Data/SchemaDefinition.cs
Original file line number Diff line number Diff line change
Expand Up @@ -73,25 +73,12 @@ public sealed class ColumnAttribute : Attribute
public ColumnAttribute(string ordinal, string name = null)
{
Name = name;
Ordinal = ordinal;
}

/// <summary>
/// Column name.
/// </summary>
public string Name { get; }

/// <summary>
/// Contains positions of indices of source columns in the form
/// of ranges. Examples of range: if we want to include just column
/// with index 1 we can write the range as 1, if we want to include
/// columns 1 to 10 then we can write the range as 1-10 and we want to include all the
/// columns from column with index 1 until end then we can write 1-*.
///
/// This takes sequence of ranges that are comma seperated, example:
/// 1,2-5,10-*
/// </summary>
public string Ordinal { get; }
}

/// <summary>
Expand Down
Loading