-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Clean up of TextLoader constructor #1784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/// <param name="env">The environment to use.</param> | ||
/// <param name="columns">Defines a mapping between input columns in the file and IDataView columns.</param> | ||
/// <param name="hasHeader">Whether the file has a header.</param> | ||
/// <param name="separatorChars">Defines the characters used as separators between data points in a row. By default the tab character is taken as separator.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default the tab character [](start = 112, length = 29)
this statement and char[] separatorChars = null
a bit weird.
I know what latter down the line we probably check for null in separators, and use tab as default, but still. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is not ideal, that's why I have an explanation in the documentation above. But if you have a better idea, I would be happy to take it!
In reply to: 237683995 [](ancestors = 237683995)
// We read the first 11 values as a single float vector. | ||
new TextLoader.Column("FeatureVector", DataKind.R4, 0, 10), | ||
|
||
// Separately, read the target variable. | ||
new TextLoader.Column("Target", DataKind.R4, 11), | ||
}, | ||
// First line of the file is a header, not a data row. | ||
HasHeader = true, | ||
true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still qualify hasHeader:
here. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apply this everywhere - especially in docs/samples/etc.
In reply to: 238006767 [](ancestors = 238006767)
// Default separator is tab, but we need a semicolon. | ||
Separator = ";" | ||
}); | ||
new[] { ';' } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the single separator the more common case? Maybe the "simple" constructor just takes a single character separator. And the "advanced" case can support multiple separators. #Resolved
/// <param name="catalog">The catalog.</param> | ||
/// <param name="args">The arguments to text reader, describing the data schema.</param> | ||
/// <param name="dataSample">The optional location of a data sample.</param> | ||
public static TextLoader TextReader(this DataOperations catalog, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yesterday in the API design review, we decided against this approach. See the notes from the review here: https://github.com/dotnet/apireviews/pull/81/files
But basically, the pattern decided will be:
- Make a simple constructor/factory that has the most common parameters.
- If there are advanced parameters that we don't want exposed in Get a working build #1, then make another constructor/factory that takes the only the nested Arguments class (to be renamed to "Options").
We are going to move away from the Action<Arguments> advancedSettings
approach. One main reason is because there can be conflicts between the "simple" parameters and the "advanced" parameters - and which one should win? Another reasoning is that it is simpler and understandable to construct and pass an object to a method.
I'd say, for this change, let's not move away from where we are going. You don't need to rename Arguments to Options. But let's leave this overload, and remove the "advancedSettings" parameter below instead. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have left the overload as you suggested and removed the advanced arguments parameter from the constructors.
In reply to: 238009119 [](ancestors = 238009119)
Thanks @eerhardt! I'll update accordingly #Resolved |
Sorry for the "late breaking" change here, but I thought it would be good to note where we landed yesterday, and not have to redo some of this work. #Resolved |
I am doing the changes and it looks a lot better, it's good to see that! In reply to: 443354726 [](ancestors = 443354726) |
/// <param name="advancedSettings">The delegate to set additional settings</param> | ||
/// <param name="path">The path to the file</param> | ||
/// <param name="hasHeader">Whether the file has a header.</param> | ||
/// <param name="separatorChar"> The character used as separator between data points in a row. By default the tab character is used as separator.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[](start = 40, length = 1)
extra space, here and in constructor above. #Resolved
|
||
var env = catalog.GetEnvironment(); | ||
|
||
// REVIEW: it is almost always a mistake to have a 'trainable' text loader here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// REVIEW: it is almost always a mistake to have a 'trainable' text loader here [](start = 12, length = 79)
Did it work well if you specify header to true and didn't pass dataSample? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, check this test public void CustomTransformer()
In reply to: 239161201 [](ancestors = 239161201)
@@ -283,8 +283,7 @@ internal static void SaveRoleMappings(IHostEnvironment env, IChannel ch, RoleMap | |||
{ | |||
// REVIEW: Should really validate the schema here, and consider | |||
// ignoring this stream if it isn't as expected. | |||
var loader = TextLoader.ReadFile(env, new TextLoader.Arguments(), | |||
new RepositoryStreamWrapper(rep, DirTrainingInfo, RoleMappingFile)); | |||
var loader = TextLoader.ReadFile(env, new RepositoryStreamWrapper(rep, DirTrainingInfo, RoleMappingFile), new TextLoader.Arguments()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, new TextLoader.Arguments() [](start = 120, length = 28)
can we change method to have arg default = null, and if it's null to use new TextLoader.Arguments()? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can do that. Do you think I should change that in the constructor of TextLoader too?
In reply to: 239162303 [](ancestors = 239162303)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -64,30 +64,29 @@ public void TrainSentiment() | |||
{ | |||
var env = new MLContext(seed: 1); | |||
// Pipeline | |||
var loader = TextLoader.ReadFile(env, | |||
new TextLoader.Arguments() | |||
var arguemnts = new TextLoader.Arguments() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arguemnts [](start = 16, length = 9)
msitype #Resolved
@@ -369,7 +364,11 @@ public TermLookupTransformer(IHostEnvironment env, IDataView input, IDataView lo | |||
var txtArgs = new TextLoader.Arguments(); | |||
bool parsed = CmdParser.ParseArguments(host, "col=Term:TX:0 col=Value:TX:1", txtArgs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need this anymore #Resolved
@@ -595,7 +595,7 @@ public void RankingLightGBMTest() | |||
public void TestTreeEnsembleCombiner() | |||
{ | |||
var dataPath = GetDataPath("breast-cancer.txt"); | |||
var dataView = TextLoader.Create(Env, new TextLoader.Arguments(), new MultiFileSource(dataPath)); | |||
var dataView = TextLoader.ReadFile(Env, new MultiFileSource(dataPath), new TextLoader.Arguments()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReadFile [](start = 38, length = 8)
can you use ReadTextFile instead? #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't text
implied by the name of TextLoader
? #Resolved
@@ -617,7 +617,7 @@ public void TestTreeEnsembleCombiner() | |||
public void TestTreeEnsembleCombinerWithCategoricalSplits() | |||
{ | |||
var dataPath = GetDataPath("adult.tiny.with-schema.txt"); | |||
var dataView = TextLoader.Create(Env, new TextLoader.Arguments(), new MultiFileSource(dataPath)); | |||
var dataView = TextLoader.ReadFile(Env, new MultiFileSource(dataPath), new TextLoader.Arguments()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReadFile [](start = 38, length = 8)
ReadTextFile #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't text implied by the name of TextLoader
? #Resolved
@@ -438,7 +438,7 @@ protected void VerifyArgParsing(IHostEnvironment env, string[] strs) | |||
|
|||
// Note that we don't pass in "args", but pass in a default args so we test | |||
// the auto-schema parsing. | |||
var loadedData = TextLoader.ReadFile(env, new TextLoader.Arguments(), new MultiFileSource(pathData)); | |||
var loadedData = TextLoader.ReadFile(env, new MultiFileSource(pathData), new TextLoader.Arguments()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReadFile [](start = 40, length = 8)
ReadTextFile? #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't text
implied by the name of TextLoader
? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Eric, since it's a method of TextLoader it should already implied that it loads a text file. #Resolved
var reader = mlContext.Data.TextReader(new TextLoader.Arguments | ||
{ | ||
Column = new[] { | ||
var reader = mlContext.Data.TextReader(new[] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TextReader [](start = 40, length = 10)
please update https://github.com/dotnet/machinelearning/blob/master/docs/code/MlNetCookBook.md with your changes. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// <param name="args">The arguments to text reader, describing the data schema.</param> | ||
/// <param name="columns">The columns of the schema.</param> | ||
/// <param name="hasHeader">Whether the file has a header.</param> | ||
/// <param name="separatorChar">The character used as separator between data points in a row. By default the tab character is used as separator.</param> | ||
/// <param name="dataSample">The optional location of a data sample.</param> | ||
public static TextLoader TextReader(this DataOperations catalog, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want these named CreateTextReader
. See https://github.com/dotnet/apireviews/pull/81/files #Resolved
/// <param name="hasHeader">Whether the file has a header.</param> | ||
/// <param name="separatorChar"> The character used as separator between data points in a row. By default the tab character is used as separator.</param> | ||
/// <param name="fileSource">Specifies a file from which to read.</param> | ||
public static IDataView ReadFile(IHostEnvironment env, IMultiStreamSource fileSource, Column[] columns, bool hasHeader = false, char separatorChar = '\t') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have both IDataView ReadFromTextFile(this DataOperations catalog,
and these methods? I think we should only have 1. #Resolved
@eerhardt I just removed the ReadFile method from TextLoader, as you suggested. #Resolved |
I am actually updating the cookbook again, to reflect that and the new name for the MlContext extension. In reply to: 444982206 [](ancestors = 444982206) |
{ | ||
var result = new Arguments { Column = columns }; | ||
advancedSettings?.Invoke(result); | ||
separatorChars = separatorChars ?? new[] { '\t' }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) separatorChars
can never be null
, right? This is a private method and only called in 1 spot that ensures it won't be null.
Maybe just add an Assert it won't be null and you can remove the null check here. #Resolved
var loader = TextLoader.ReadFile(env, new TextLoader.Arguments(), | ||
new RepositoryStreamWrapper(rep, DirTrainingInfo, RoleMappingFile)); | ||
var loader = new TextLoader(env, dataSample: new RepositoryStreamWrapper(rep, DirTrainingInfo, RoleMappingFile)) | ||
.Read(new RepositoryStreamWrapper(rep, DirTrainingInfo, RoleMappingFile)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't create two instances of new RepositoryStreamWrapper(rep, DirTrainingInfo, RoleMappingFile)
. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it may be valuable to still have an internal TextLoader.ReadFile
helper method for our internal code.
In reply to: 239856914 [](ancestors = 239856914)
new TextLoader.Column("Term", DataKind.TX, 0), | ||
new TextLoader.Column("Value", DataKind.TX, 1) | ||
}, | ||
dataSample: new MultiFileSource(filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here about creating duplicate objects - new MultiFileSource(filename)
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just a couple minor comments to clean up.
Fixes #1611.
TextLoader
that takes Arguments, and exposedHasHeader
andSeparatorChars
as non-advanced parameters.