Skip to content

Schema based text loader #1878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Dec 21, 2018
104 changes: 92 additions & 12 deletions src/Microsoft.ML.Data/Data/SchemaDefinition.cs
Original file line number Diff line number Diff line change
Expand Up @@ -68,30 +68,110 @@ public VectorTypeAttribute(params int[] dims)
/// column encapsulates.
/// </summary>
[AttributeUsage(AttributeTargets.Field | AttributeTargets.Property, AllowMultiple = false, Inherited = true)]
public sealed class ColumnAttribute : Attribute
public sealed class LoadColumnAttribute : Attribute
Copy link
Contributor

@Zruty0 Zruty0 Dec 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoadColumnAttribute [](start = 24, length = 19)

This attribute should go next to TextLoader, not as part of this class. #Closed

{
public ColumnAttribute(string ordinal, string name = null)

public LoadColumnAttribute(int ordinal, string name = null, bool loadAllOthers = false)
Copy link
Member Author

@sfilipi sfilipi Dec 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public [](start = 8, length = 6)

XMl docs on all ctors and samples for all ctors #Resolved

Copy link
Contributor

@Zruty0 Zruty0 Dec 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ordinal [](start = 39, length = 7)

how about index, or columnIndex? The word ordinal does sort-of mean the same thing, but it's the only place in the codebase that we use the word #Closed

Copy link
Contributor

@Zruty0 Zruty0 Dec 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name [](start = 55, length = 4)

can we remove name? This duplicates the functionality of ColumnName for no good reason #Closed

Copy link
Contributor

@Zruty0 Zruty0 Dec 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loadAllOthers [](start = 73, length = 13)

that's a misleading name. Maybe invert or complement is more appropriate. But, as I've said offline, I think we can safely get rid of this functionality altogether, it's better than having to explain it :) #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this will be useful when the user wants to load all features into one column, and the label in another.

we'd spare them having to look up their cols cardinality, and a concat.


In reply to: 242673339 [](ancestors = 242673339)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not enough reason, in my opinion


In reply to: 242741558 [](ancestors = 242741558,242673339)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed


In reply to: 243121936 [](ancestors = 243121936,242741558,242673339)

{
Start = ordinal.ToString();
Sources = new List<TextLoader.Range>();
var range = new TextLoader.Range(ordinal);
range.AllOther = loadAllOthers;
Sources.Add(range);
}

public LoadColumnAttribute(string start, string end = null, string name = null, int[] columnIndexes = null)
Copy link
Contributor

@Zruty0 Zruty0 Dec 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string [](start = 35, length = 6)

start and end should be integers here. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't have nullable ints. we can't leave 'end' as Auto, if i flip to ints. that ok?


In reply to: 242669241 [](ancestors = 242669241)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to preserve the '10-*' logic, I would go with a different attribute for it ([LoadAllColumns(start: 10)] or something like that).


In reply to: 242741038 [](ancestors = 242741038,242669241)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the '*' is going to be as useful in API as it is in cmdline.


In reply to: 243120230 [](ancestors = 243120230,242741038,242669241)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove


In reply to: 243120291 [](ancestors = 243120291,243120230,242741038,242669241)

Copy link
Contributor

@Zruty0 Zruty0 Dec 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

columnIndexes [](start = 94, length = 13)

I think this parameter is more misleading than it is helpful. Can you remove it? #Closed

{
Name = name;
Ordinal = ordinal;
Start = start;
End = end;
ColumnIndexes = columnIndexes;

Sources = new List<TextLoader.Range>();

bool hasEnd = int.TryParse(end, out int endIndex);
var range = hasEnd ? new TextLoader.Range(int.Parse(start), endIndex) : new TextLoader.Range(int.Parse(start));
Sources.Add(range);

if (columnIndexes != null)
{
foreach (var col in columnIndexes)
Sources.Add(new TextLoader.Range(col));
}
}

// REVIEW : AllOther seems to work only for a single column. Verify.
public LoadColumnAttribute(string start, string end, string name = null, bool loadInverseRange = false)
{
Name = name;
LoadInverseRange = loadInverseRange;
Start = start;
End = end;

Sources = new List<TextLoader.Range>();
var range = new TextLoader.Range(int.Parse(start), int.Parse(end));
range.AllOther = loadInverseRange;
Sources.Add(range);
}

public LoadColumnAttribute(int[] columnIndexes, string name = null)
{
Name = name;
ColumnIndexes = columnIndexes;

Sources = new List<TextLoader.Range>();
foreach (var col in columnIndexes)
Sources.Add(new TextLoader.Range(col));
}

internal List<TextLoader.Range> Sources;

/// <summary>
/// Column name.
/// </summary>
public string Name { get; }

/// <summary>
/// Contains positions of indices of source columns in the form
/// of ranges. Examples of range: if we want to include just column
/// with index 1 we can write the range as 1, if we want to include
/// columns 1 to 10 then we can write the range as 1-10 and we want to include all the
/// columns from column with index 1 until end then we can write 1-*.
///
/// This takes sequence of ranges that are comma seperated, example:
/// 1,2-5,10-*
/// The optional start index for loading a contiguous range of columns, or the single index in the case
/// of loading a single column.
/// Either this parameters, or the <see cref="ColumnIndexes"/> should be specified.
/// </summary>
public string Start { get; }

/// <summary>
/// Optional field, used to set the dataset columns range end index when loading a range of columns.
/// </summary>
public string Ordinal { get; }
public string End { get; }

/// <summary>
/// Optional field used to specify the distinct indices of the dataset columns that need to be loaded, and mapped to this
/// <see cref="TextLoader.Column"/>.
/// </summary>
public int[] ColumnIndexes { get; }

/// <summary>
/// If this is set to true, the columns defined in the range through either the <see cref="Start"/>, <see cref="End"/> or the
/// <see cref="ColumnIndexes"/> will be excluded from loading, and all the other ones will loaded and mapped to the <see cref="TextLoader.Column"/>.
/// </summary>
public bool LoadInverseRange { get; }
}

/// <summary>
/// Describes column information such as name and the source columns indicies that this
/// column encapsulates.
/// </summary>
[AttributeUsage(AttributeTargets.Field | AttributeTargets.Property, AllowMultiple = false, Inherited = true)]
public sealed class ColumnAttribute : Attribute
{
public ColumnAttribute(string ordinal, string name = null)
{
Name = name;
}

/// <summary>
/// Column name.
/// </summary>
public string Name { get; }
}

/// <summary>
Expand Down
19 changes: 14 additions & 5 deletions src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs
Original file line number Diff line number Diff line change
Expand Up @@ -343,28 +343,28 @@ public class ArgumentsCore
" missing value and an empty value is denoted by \"\". When false, consecutive separators" +
" denote an empty value.",
ShortName = "quote")]
public bool AllowQuoting = true;
public bool AllowQuoting = DefaultArguments.AllowQuoting;

[Argument(ArgumentType.AtMostOnce, HelpText = "Whether the input may include sparse representations", ShortName = "sparse")]
public bool AllowSparse = true;
public bool AllowSparse = DefaultArguments.AllowSparse;

[Argument(ArgumentType.AtMostOnce,
HelpText = "Number of source columns in the text data. Default is that sparse rows contain their size information.",
ShortName = "size")]
public int? InputSize;

[Argument(ArgumentType.AtMostOnce, Visibility = ArgumentAttribute.VisibilityType.CmdLineOnly, HelpText = "Source column separator. Options: tab, space, comma, single character", ShortName = "sep")]
public string Separator = "tab";
public string Separator = "tab"; //DefaultArguments.Separator
Copy link
Member

@singlis singlis Dec 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DefaultArguments [](start = 47, length = 16)

Is this comment needed? #Resolved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe just two default arguments Separator and SeparatorChars? Two different params doesnt guarantee things to be in sync. Im assuming since these are argruments that they can be set by the user -- so what happens if a user specifies Separator to be space, but doesnt set the separator chars?


In reply to: 243411408 [](ancestors = 243411408)


[Argument(ArgumentType.AtMostOnce, Name = nameof(Separator), Visibility = ArgumentAttribute.VisibilityType.EntryPointsOnly, HelpText = "Source column separator.", ShortName = "sep")]
public char[] SeparatorChars = new[] { '\t' };
public char[] SeparatorChars = new[] { DefaultArguments.Separator };

[Argument(ArgumentType.Multiple, HelpText = "Column groups. Each group is specified as name:type:numeric-ranges, eg, col=Features:R4:1-17,26,35-40",
ShortName = "col", SortOrder = 1)]
public Column[] Column;

[Argument(ArgumentType.AtMostOnce, HelpText = "Remove trailing whitespace from lines", ShortName = "trim")]
public bool TrimWhitespace;
public bool TrimWhitespace = DefaultArguments.TrimWhitespace;

[Argument(ArgumentType.AtMostOnce, ShortName = "header",
HelpText = "Data file has header with feature names. Header is read only if options 'hs' and 'hf' are not specified.")]
Expand Down Expand Up @@ -392,6 +392,15 @@ public sealed class Arguments : ArgumentsCore
public long? MaxRows;
}

internal static class DefaultArguments
{
internal const bool AllowQuoting = true;
internal const bool AllowSparse = true;
internal const char Separator = '\t';
internal const bool HasHeader = false;
internal const bool TrimWhitespace = false;
}

/// <summary>
/// Used as an input column range.
/// A variable length segment (extending to the end of the input line) is represented by Lim == SrcLim.
Expand Down
91 changes: 86 additions & 5 deletions src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderSaverCatalog.cs
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,16 @@
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

using Microsoft.ML.Data;
using Microsoft.ML.Runtime;
using Microsoft.ML.Runtime.Data;
using Microsoft.ML.Runtime.Data.IO;
using System;
using System.Collections.Generic;
using System.IO;
using static Microsoft.ML.Runtime.Data.TextLoader;
using System.Linq;
using System.Reflection;
using System.Text.RegularExpressions;

namespace Microsoft.ML
{
Expand All @@ -22,7 +26,7 @@ public static class TextLoaderSaverCatalog
/// <param name="separatorChar">The character used as separator between data points in a row. By default the tab character is used as separator.</param>
/// <param name="dataSample">The optional location of a data sample.</param>
public static TextLoader CreateTextReader(this DataOperations catalog,
Column[] columns, bool hasHeader = false, char separatorChar = '\t', IMultiStreamSource dataSample = null)
TextLoader.Column[] columns, bool hasHeader = false, char separatorChar = '\t', IMultiStreamSource dataSample = null)
=> new TextLoader(CatalogUtils.GetEnvironment(catalog), columns, hasHeader, separatorChar, dataSample);

/// <summary>
Expand All @@ -31,9 +35,86 @@ public static TextLoader CreateTextReader(this DataOperations catalog,
/// <param name="catalog">The catalog.</param>
/// <param name="args">Defines the settings of the load operation.</param>
/// <param name="dataSample">Allows to expose items that can be used for reading.</param>
public static TextLoader CreateTextReader(this DataOperations catalog, Arguments args, IMultiStreamSource dataSample = null)
public static TextLoader CreateTextReader(this DataOperations catalog, TextLoader.Arguments args, IMultiStreamSource dataSample = null)
=> new TextLoader(CatalogUtils.GetEnvironment(catalog), args, dataSample);

/// <summary>
/// Create a text reader <see cref="TextLoader"/>.
/// </summary>
/// <param name="catalog">The catalog.</param>
/// <param name="hasHeader"></param>
/// <param name="separator"></param>
/// <param name="allowQuotedStrings"></param>
/// <param name="supportSparse"></param>
/// <param name="trimWhitespace"></param>
public static TextLoader CreateTextReader<TInput>(this DataOperations catalog,
Copy link
Contributor

@Zruty0 Zruty0 Dec 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateTextReader [](start = 33, length = 16)

This entire code should move out of the catalog and into a static method of TextLoader. The catalogs in general should not contain any meaningful code, they are there to expose the existing public interface in a consistent 'folder-like' structure. #Closed

bool hasHeader = TextLoader.DefaultArguments.HasHeader,
char separator = TextLoader.DefaultArguments.Separator,
bool allowQuotedStrings = TextLoader.DefaultArguments.AllowQuoting,
bool supportSparse = TextLoader.DefaultArguments.AllowSparse,
bool trimWhitespace = TextLoader.DefaultArguments.TrimWhitespace)
{
var userType = typeof(TInput);
Copy link
Member

@wschin wschin Dec 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe what this function is doing? I'd like to see something ComputeOutputSchema in this PR where high-level description is added to header and low-level descriptions are inline to explain high-level concept in detail. #Resolved


var fieldInfos = userType.GetFields(BindingFlags.Public | BindingFlags.Instance);

var propertyInfos =
userType
.GetProperties(BindingFlags.Public | BindingFlags.Instance)
.Where(x => x.CanRead && x.CanWrite && x.GetGetMethod() != null && x.GetSetMethod() != null && x.GetIndexParameters().Length == 0);

var memberInfos = (fieldInfos as IEnumerable<MemberInfo>).Concat(propertyInfos).ToArray();

var columns = new TextLoader.Column[memberInfos.Length];

for (int index = 0; index < memberInfos.Length; index++)
{
var memberInfo = memberInfos[index];
var mappingAttr = memberInfo.GetCustomAttribute<LoadColumnAttribute>();
var mptr = memberInfo.GetCustomAttributes();

Contracts.Assert(mappingAttr != null, $"Field or property {memberInfo.Name} is missing the LoadColumn attribute");
Copy link
Contributor

@Zruty0 Zruty0 Dec 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoadColumn [](start = 107, length = 10)

maybe nameof ? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this comment to ask a meta quesiton: does it make more sense to continue that to throw? Maybe not all the properties of the data model will be decorated with LoadColumn. @Zruty0, @tomfinly


In reply to: 242670646 [](ancestors = 242670646)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolving the questions, since the discussion has moved to the other comment.


In reply to: 243040913 [](ancestors = 243040913,242670646)

Copy link
Contributor

@Zruty0 Zruty0 Dec 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert [](start = 26, length = 6)

it should be Check if the user can hit it #Closed


var column = new TextLoader.Column();
column.Name = mappingAttr.Name ?? memberInfo.Name;
Copy link
Contributor

@Zruty0 Zruty0 Dec 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name [](start = 23, length = 4)

check for ColumnName instead #Closed

column.Source = mappingAttr.Sources.ToArray();
DataKind dk;
switch (memberInfo)
{
case FieldInfo field:
if (!DataKindExtensions.TryGetDataKind(field.FieldType.IsArray ? field.FieldType.GetElementType() : field.FieldType, out dk))
throw Contracts.Except($"Field {memberInfo.Name} is of unsupported type.");

break;

case PropertyInfo property:
if (!DataKindExtensions.TryGetDataKind(property.PropertyType.IsArray ? property.PropertyType.GetElementType() : property.PropertyType, out dk))
throw Contracts.Except($"Property {memberInfo.Name} is of unsupported type.");
break;

default:
Contracts.Assert(false);
throw Contracts.ExceptNotSupp("Expected a FieldInfo or a PropertyInfo");
}

column.Type = dk;

columns[index] = column;
}

TextLoader.Arguments args = new TextLoader.Arguments
{
HasHeader = hasHeader,
SeparatorChars = new[] { separator },
AllowQuoting = allowQuotedStrings,
AllowSparse = supportSparse,
TrimWhitespace = trimWhitespace,
Column = columns
};

return new TextLoader(CatalogUtils.GetEnvironment(catalog), args);
}

/// <summary>
/// Read a data view from a text file using <see cref="TextLoader"/>.
/// </summary>
Expand All @@ -44,7 +125,7 @@ public static TextLoader CreateTextReader(this DataOperations catalog, Arguments
/// <param name="path">The path to the file.</param>
/// <returns>The data view.</returns>
public static IDataView ReadFromTextFile(this DataOperations catalog,
Copy link
Contributor

@Zruty0 Zruty0 Dec 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadFromTextFile [](start = 32, length = 16)

we should also add a templated version of ReadFromTextFile #Closed

string path, Column[] columns, bool hasHeader = false, char separatorChar = '\t')
string path, TextLoader.Column[] columns, bool hasHeader = false, char separatorChar = '\t')
{
Contracts.CheckNonEmpty(path, nameof(path));

Expand All @@ -62,7 +143,7 @@ public static IDataView ReadFromTextFile(this DataOperations catalog,
/// <param name="catalog">The catalog.</param>
/// <param name="path">Specifies a file from which to read.</param>
/// <param name="args">Defines the settings of the load operation.</param>
public static IDataView ReadFromTextFile(this DataOperations catalog, string path, Arguments args = null)
public static IDataView ReadFromTextFile(this DataOperations catalog, string path, TextLoader.Arguments args = null)
{
Contracts.CheckNonEmpty(path, nameof(path));

Expand Down
12 changes: 6 additions & 6 deletions src/Microsoft.ML.Legacy/Data/TextLoader.cs
Original file line number Diff line number Diff line change
Expand Up @@ -86,20 +86,20 @@ public TextLoader CreateFrom<TInput>(bool useHeader = false,
for (int index = 0; index < memberInfos.Length; index++)
{
var memberInfo = memberInfos[index];
var mappingAttr = memberInfo.GetCustomAttribute<ColumnAttribute>();
var mappingAttr = memberInfo.GetCustomAttribute<LoadColumnAttribute>();
if (mappingAttr == null)
throw Contracts.Except($"Field or property {memberInfo.Name} is missing ColumnAttribute");
throw Contracts.Except($"Field or property {memberInfo.Name} is missing LoadColumnAttributeAttribute");

if (Regex.Match(mappingAttr.Ordinal, @"[^(0-9,\*\-~)]+").Success)
throw Contracts.Except($"{mappingAttr.Ordinal} contains invalid characters. " +
if (Regex.Match(mappingAttr.Start, @"[^(0-9,\*\-~)]+").Success)
throw Contracts.Except($"{mappingAttr.Start} contains invalid characters. " +
$"Valid characters are 0-9, *, - and ~");

var mappingNameAttr = memberInfo.GetCustomAttribute<ColumnNameAttribute>();
var name = mappingAttr.Name ?? mappingNameAttr?.Name ?? memberInfo.Name;

Runtime.Data.TextLoader.Range[] sources;
if (!Runtime.Data.TextLoader.Column.TryParseSourceEx(mappingAttr.Ordinal, out sources))
throw Contracts.Except($"{mappingAttr.Ordinal} could not be parsed.");
if (!Runtime.Data.TextLoader.Column.TryParseSourceEx(mappingAttr.Start, out sources))
throw Contracts.Except($"{mappingAttr.Start} could not be parsed.");

Contracts.Assert(sources != null);

Expand Down
4 changes: 2 additions & 2 deletions test/Microsoft.ML.Benchmarks/PredictionEngineBench.cs
Original file line number Diff line number Diff line change
Expand Up @@ -146,10 +146,10 @@ public void MakeBreastCancerPredictions()

public class SentimentData
{
[ColumnName("Label"), Column("0")]
[ColumnName("Label"), LoadColumn(0)]
public bool Sentiment;

[Column("1")]
[LoadColumn(1)]
public string SentimentText;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -167,19 +167,19 @@ private void Consume(IEnumerable<IrisPrediction> predictions)

public class IrisData
{
[Column("0")]
[LoadColumn(0)]
public float Label;

[Column("1")]
[LoadColumn(1)]
public float SepalLength;

[Column("2")]
[LoadColumn(2)]
public float SepalWidth;

[Column("3")]
[LoadColumn(3)]
public float PetalLength;

[Column("4")]
[LoadColumn(4)]
public float PetalWidth;
}

Expand Down
Loading