-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Schema based text loader #1878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schema based text loader #1878
Conversation
public long Label; | ||
|
||
[Column(ordinal: "1-784")] | ||
[LoadColumn(range: "1-784")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LoadColumn [](start = 13, length = 10)
I think we need some good word smith here.
It's probably slightly better than just column, but still confusing, especially next to that ColumnName attribute few lines below.
Can it be TextLoader.MarkingColumn? TextLoader.MappingColumn, something related to textloader since it's used only for Textloader (right?)
#Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is called LoadColumn, if you are referring to the new attribute.
In reply to: 241852931 [](ancestors = 241852931)
bool supportSparse = TextLoader.DefaultArguments.AllowSparse, | ||
bool trimWhitespace = TextLoader.DefaultArguments.TrimWhitespace) | ||
{ | ||
var userType = typeof(TInput); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe what this function is doing? I'd like to see something ComputeOutputSchema
in this PR where high-level description is added to header and low-level descriptions are inline to explain high-level concept in detail. #Resolved
Reading the description of both #561 and #1515, i prototyped around having more attribute parameters for the LoadColumn attribute, to get closed to the TextLoader.Range. I tried to eliminate the need for the existing Ordinal string by adding a but the scenario that we can't do is multiple ranges. i.e. 1-4, 5-6 One idea on how to convert that would be to provide an array where the users would specify the indexes with the convention that they should be treated in groups; example: new [] { 1, 3, 5, 7} would translate to 1-3, 5-7 but that seems UGLY. Another idea is to let the user specify multiple attributes for the range, and have the column name specified separately. Example:
But, IMHO, the range regex beats this too from the prospective of simplicity and ease of use. So to summarize, i propose: The construction of the TextLoader columns from the attributes is not final. I will complete after getting a final word here. @eerhardt, @TomFinley, @Zruty0, @CESARDELATORRE #Resolved |
Decision is to break down the range/ordinal. If more granularity is needed, the users can use the text loader columns. #RESOLVED #Resolved |
{ | ||
public ColumnAttribute(string ordinal, string name = null) | ||
|
||
public LoadColumnAttribute(int ordinal, string name = null, bool loadAllOthers = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public [](start = 8, length = 6)
XMl docs on all ctors and samples for all ctors #Resolved
public string Start { get; } | ||
public string End { get; } | ||
public int[] ColumnsIndexes { get; } | ||
public int[] InverseColumnsIndexes { get; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed, remove #Resolved
@@ -68,30 +68,95 @@ public VectorTypeAttribute(params int[] dims) | |||
/// column encapsulates. | |||
/// </summary> | |||
[AttributeUsage(AttributeTargets.Field | AttributeTargets.Property, AllowMultiple = false, Inherited = true)] | |||
public sealed class ColumnAttribute : Attribute | |||
public sealed class LoadColumnAttribute : Attribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LoadColumnAttribute [](start = 24, length = 19)
This attribute should go next to TextLoader
, not as part of this class. #Closed
/// <param name="ordinal">The index of the column in the text file.</param> | ||
/// <param name="name">The optional name of the column, if it should be different from the name of the field or property where the attribute is positioned.</param> | ||
/// <param name="loadAllOthers">Wheather to load all columns, besides the one specified in the ordinal.</param> | ||
public LoadColumnAttribute(int ordinal, string name = null, bool loadAllOthers = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ordinal [](start = 39, length = 7)
how about index
, or columnIndex
? The word ordinal
does sort-of mean the same thing, but it's the only place in the codebase that we use the word #Closed
/// <param name="ordinal">The index of the column in the text file.</param> | ||
/// <param name="name">The optional name of the column, if it should be different from the name of the field or property where the attribute is positioned.</param> | ||
/// <param name="loadAllOthers">Wheather to load all columns, besides the one specified in the ordinal.</param> | ||
public LoadColumnAttribute(int ordinal, string name = null, bool loadAllOthers = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name [](start = 55, length = 4)
can we remove name
? This duplicates the functionality of ColumnName
for no good reason #Closed
/// <param name="end">The ending column index, for the range.</param> | ||
/// <param name="name">The optional name of the column, if it should be different from the name of the field or property where the attribute is positioned.</param> | ||
/// <param name="columnIndexes">Distinct text file column indces to load as part of this column.</param> | ||
public LoadColumnAttribute(string start, string end = null, string name = null, int[] columnIndexes = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string [](start = 35, length = 6)
start and end should be integers here. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't have nullable ints. we can't leave 'end' as Auto, if i flip to ints. that ok?
In reply to: 242669241 [](ancestors = 242669241)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to preserve the '10-*' logic, I would go with a different attribute for it ([LoadAllColumns(start: 10)]
or something like that).
In reply to: 242741038 [](ancestors = 242741038,242669241)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the '*' is going to be as useful in API as it is in cmdline.
In reply to: 243120230 [](ancestors = 243120230,242741038,242669241)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// <param name="end">The ending column index, for the range.</param> | ||
/// <param name="name">The optional name of the column, if it should be different from the name of the field or property where the attribute is positioned.</param> | ||
/// <param name="columnIndexes">Distinct text file column indces to load as part of this column.</param> | ||
public LoadColumnAttribute(string start, string end = null, string name = null, int[] columnIndexes = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
columnIndexes [](start = 94, length = 13)
I think this parameter is more misleading than it is helpful. Can you remove it? #Closed
/// </summary> | ||
/// <param name="columnIndexes">Distinct text file column indces to load as part of this column.</param> | ||
/// <param name="name">The optional name of the column, if it should be different from the name of the field or property where the attribute is positioned.</param> | ||
public LoadColumnAttribute(int[] columnIndexes, string name = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
columnIndexes [](start = 41, length = 13)
columnIndices
#Closed
var mappingAttr = memberInfo.GetCustomAttribute<LoadColumnAttribute>(); | ||
var mptr = memberInfo.GetCustomAttributes(); | ||
|
||
Contracts.Assert(mappingAttr != null, $"Field or property {memberInfo.Name} is missing the LoadColumn attribute"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LoadColumn [](start = 107, length = 10)
maybe nameof
? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolving the questions, since the discussion has moved to the other comment.
In reply to: 243040913 [](ancestors = 243040913,242670646)
var mappingAttr = memberInfo.GetCustomAttribute<LoadColumnAttribute>(); | ||
var mptr = memberInfo.GetCustomAttributes(); | ||
|
||
Contracts.Assert(mappingAttr != null, $"Field or property {memberInfo.Name} is missing the LoadColumn attribute"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assert [](start = 26, length = 6)
it should be Check
if the user can hit it #Closed
/// if one of the row contains "5 2:6 4:3" that's mean there are 5 columns all zero | ||
/// except for 3rd and 5th columns which have values 6 and 3</param> | ||
/// <param name="trimWhitespace">Remove trailing whitespace from lines</param> | ||
public static TextLoader CreateTextReader<TInput>(this DataOperations catalog, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CreateTextReader [](start = 33, length = 16)
This entire code should move out of the catalog and into a static method of TextLoader
. The catalogs in general should not contain any meaningful code, they are there to expose the existing public interface in a consistent 'folder-like' structure. #Closed
Contracts.Assert(mappingAttr != null, $"Field or property {memberInfo.Name} is missing the LoadColumn attribute"); | ||
|
||
var column = new TextLoader.Column(); | ||
column.Name = mappingAttr.Name ?? memberInfo.Name; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name [](start = 23, length = 4)
check for ColumnName
instead #Closed
/// <param name="ordinal">The index of the column in the text file.</param> | ||
/// <param name="name">The optional name of the column, if it should be different from the name of the field or property where the attribute is positioned.</param> | ||
/// <param name="loadAllOthers">Wheather to load all columns, besides the one specified in the ordinal.</param> | ||
public LoadColumnAttribute(int ordinal, string name = null, bool loadAllOthers = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loadAllOthers [](start = 73, length = 13)
that's a misleading name. Maybe invert
or complement
is more appropriate. But, as I've said offline, I think we can safely get rid of this functionality altogether, it's better than having to explain it :) #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this will be useful when the user wants to load all features into one column, and the label in another.
we'd spare them having to look up their cols cardinality, and a concat.
In reply to: 242673339 [](ancestors = 242673339)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -29,7 +29,7 @@ void New_DecomposableTrainAndPredict() | |||
var dataPath = GetDataPath(TestDatasets.irisData.trainFilename); | |||
var ml = new MLContext(); | |||
|
|||
var data = ml.Data.CreateTextReader(TestDatasets.irisData.GetLoaderColumns(), separatorChar: ',') | |||
var data = ml.Data.CreateTextReader<IrisData>(separator: ',') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CreateTextReader [](start = 31, length = 16)
maybe ReadFromTextFile
here and everywhere in these New_
tests (where appropriate)? #Closed
I think we should modify the cookbook samples to reflect this as a suggested way to read text files. #Closed |
Maybe not all of the examples, but there should be some that use the class definition and not In reply to: 448340735 [](ancestors = 448340735) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🕐
docs/code/MlNetCookBook.md
Outdated
[LoadColumn(3)] | ||
public string MaritalStatus { get; set; } | ||
|
||
public string[] AllFeatures { get; set; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AllFeatures [](start = 17, length = 11)
will this work? I am suspecting that reading this will throw #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does work because i changed the logic. See my questions about whether we should process only the annotated members/field, to make the data models more usable.
In reply to: 243104636 [](ancestors = 243104636)
var data = mlContext.Data.ReadFromTextFile<InspectedRow>(dataPath, | ||
// First line of the file is a header, not a data row. | ||
hasHeader: true | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[](start = 1, length = 1)
please avoid tabs in this file #Resolved
/// <param name="start">The starting column index, for the range.</param> | ||
/// <param name="end">The ending column index, for the range.</param> | ||
public LoadColumnAttribute(int start, int end) | ||
: this(start.ToString()) //REVIEW this is incorrect, but it is just temporary there, until the Legacy API's TextLoader gets deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//REVIEW this is incorrect, but it is just temporary there, until the Legacy API's TextLoader gets deleted. [](start = 38, length = 108)
// REVIEW:
Also place it somewhere at the beginning of the line #Resolved
var mappingAttr = memberInfo.GetCustomAttribute<LoadColumnAttribute>(); | ||
|
||
if(mappingAttr == null) | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, no longer throwing? I'm not sure I'm a fan of it. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomFinley @glebuk what do you think?
does it make more sense to continue that to throw? Maybe not all the properties of the data model will be decorated with LoadColumn. If we throw, we restrict usage by expecting the class to be dedicated to just loading the data, whereas if we don't the type can be used for other thing, and the decorated fields can be used for loading.
In reply to: 243120594 [](ancestors = 243120594)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you Senja, is there some benefit with throwing?
In reply to: 243358643 [](ancestors = 243358643,243120594)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we throw, we restrict usage by expecting the class to be dedicated to just loading the data, whereas if we don't the type can be used for other thing, and the decorated fields can be used for loading.
Interesting @sfilipi. So, it strikes me that there are two possible dangers we must guard against.
- A user attempts to use a class to define a schema, without proper attribution by mistake. We know this happens a lot.
- A user attempts to use a class to define a schema, with proper attribution, but wants to have some additional "extra" properties free for no particular reason.
Like I said I know 1 comes up, and throwing has helped people to guard against misuse. 2 has not been possible for a while now -- in fact, AFAIK was never possible. In fact, if we consider all prior usage of the API (where we follow this pattern), and all prior usage of the text loader, and I think of all the questions that we've gotten about them over the years, both outside of Microsoft in ML.NET and inside Microsoft with our internal tool, I don't recall that anyone has ever run into a situation where this presented a serious barrier to them. (But of course I am not omniscient, and while I read I think most user feedback, I have not read all of it. Have we received requests for this? I am finding it difficult to imagine a concrete scenario.)
Even if the scenario were somehow to come up, I would prefer to solve it by introducing another attribute to indicate that this property is not to be considered by the loader, rather than to have this behavior. So, something like, [NoLoader]
or whatnot. (I don't insist on the name. I also don't want us to add it now, until there's a concrete request for it, since my suspicion is that it will never come up.)
For this reason, I disagree with @sfilipi and @singlis and agree with @Zruty0. It must throw. #Resolved
@@ -22,7 +22,7 @@ public void New_TrainWithInitialPredictor() | |||
|
|||
var ml = new MLContext(seed: 1, conc: 1); | |||
|
|||
var data = ml.Data.CreateTextReader(TestDatasets.Sentiment.GetLoaderColumns(), hasHeader: true).Read(GetDataPath(TestDatasets.Sentiment.trainFilename)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetLoaderColumns [](start = 71, length = 16)
do we still use GetLoaderColumns()
anywhere? Maybe we can delete the method? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
positioning comments in the LoadColumnAttribute.
namespace Microsoft.ML.Data | ||
{ | ||
#pragma warning disable 618 | ||
/// <summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the warning disable needed? I would document why its disabled. #Resolved
/// Initializes new instance of <see cref="LoadColumnAttribute"/>. | ||
/// </summary> | ||
/// <param name="columnIndexes">Distinct text file column indices to load as part of this column.</param> | ||
// REVIEW: Calling the private constructor with just the columnIndexes[0] parameter, is incorrect, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can all the "REVIEW" items be tagged with a github issue? Just thinking that maybe easier to track when the Legacy API's TextLoader is deleted. #Resolved
|
||
[Argument(ArgumentType.AtMostOnce, | ||
HelpText = "Number of source columns in the text data. Default is that sparse rows contain their size information.", | ||
ShortName = "size")] | ||
public int? InputSize; | ||
|
||
[Argument(ArgumentType.AtMostOnce, Visibility = ArgumentAttribute.VisibilityType.CmdLineOnly, HelpText = "Source column separator. Options: tab, space, comma, single character", ShortName = "sep")] | ||
public string Separator = "tab"; | ||
public string Separator = "tab"; //DefaultArguments.Separator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DefaultArguments [](start = 47, length = 16)
Is this comment needed? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe just two default arguments Separator and SeparatorChars? Two different params doesnt guarantee things to be in sync. Im assuming since these are argruments that they can be set by the user -- so what happens if a user specifies Separator to be space, but doesnt set the separator chars?
In reply to: 243411408 [](ancestors = 243411408)
Column[] columns, bool hasHeader = false, char separatorChar = '\t', IMultiStreamSource dataSample = null) | ||
TextLoader.Column[] columns, | ||
bool hasHeader = false, | ||
char separatorChar = '\t', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separatorChar = TextLoader.DefaultArguments.Separator? #Resolved
public static void SaveAsText(this DataOperations catalog, | ||
IDataView data, | ||
Stream stream, | ||
char separator = '\t', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DefaultArguments.Separator. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial work to address #561.
It moves the current reflection-based logic outside of the legacy project, into the TextLoader catalog, creating an API for it.
Decouples the LoadColumn and Column attributes.