Skip to content

Enable TextLoader to accept new lines in quoted fields #5125

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
May 19, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
e03b0e4
fix textloader bug on quotes
LittleLittleCloud Dec 17, 2019
3ddb3b0
use static method instead of extension method
LittleLittleCloud Dec 19, 2019
9b20ead
better name
LittleLittleCloud Dec 19, 2019
34d9efa
Merge branch 'master' into u/xiaoyun/fixTextLoaderBug
LittleLittleCloud Apr 15, 2020
ac6f971
Merge remote-tracking branch 'xiaoyun/u/xiaoyun/fixTextLoaderBug' int…
antoniovs1029 May 11, 2020
8304d62
Moved multiline reader to TextLoaderCursor, and added return when mul…
antoniovs1029 May 13, 2020
a9e91e2
Fix issue with empty unquoted new lines
antoniovs1029 May 13, 2020
9cfaee1
Make ReadMultiLine a little bit more efficient
antoniovs1029 May 13, 2020
2827c61
Add clarifying comment in FetchNextField
antoniovs1029 May 13, 2020
a427662
Added test
antoniovs1029 May 13, 2020
1cafbf0
Added test for column inference
antoniovs1029 May 14, 2020
6116d97
Make multilinereader a little bit more efficient
antoniovs1029 May 14, 2020
df2ca25
Create read multilines parameter
antoniovs1029 May 14, 2020
530f41e
Make tests run with readMultilines parameter
antoniovs1029 May 14, 2020
d9af9d2
Remove public parameter for readMultilines, as it is considered a bre…
antoniovs1029 May 15, 2020
6a7b632
Update manifest
antoniovs1029 May 15, 2020
fb6ab28
Update src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs
antoniovs1029 May 15, 2020
f592a7f
Update src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderParser.cs
antoniovs1029 May 15, 2020
a13b803
Removed new readMultiline option from AutoML.NET and removed the test…
antoniovs1029 May 15, 2020
c2d2ac7
Merge branch 'is4460TextLoaderQuoting' of https://github.com/antoniov…
antoniovs1029 May 15, 2020
13033bf
Actually add new line characters to loaded strings
antoniovs1029 May 16, 2020
2983b06
Actually include new line characters in loaded strings
antoniovs1029 May 16, 2020
9789479
Let the TextSaver also correctly save new lines inside quoted fields
antoniovs1029 May 18, 2020
60c4169
Fixed bug when calling GetSomeLines
antoniovs1029 May 18, 2020
fa28ddd
Throw exception when reaching EOF without ending quote on a given field.
antoniovs1029 May 19, 2020
56279b4
Refactored readMultilines into OptionsFlags
antoniovs1029 May 19, 2020
51d9390
Added test to check new readMultiline option from MAML.
antoniovs1029 May 19, 2020
5cc7512
Merge remote-tracking branch 'upstream/master' into is4460TextLoaderQ…
antoniovs1029 May 19, 2020
5be5f6f
Fixed mistake on an unrelated test
antoniovs1029 May 19, 2020
b4e3029
Added internal default to ReadMultilines
antoniovs1029 May 19, 2020
7e16fc7
Do more checking in MultilineReader in order to be more flexible in a…
antoniovs1029 May 19, 2020
f0652a5
Updated tests
antoniovs1029 May 19, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 22 additions & 7 deletions src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs
Original file line number Diff line number Diff line change
Expand Up @@ -493,14 +493,21 @@ public class Options
/// </summary>
[Argument(ArgumentType.AtMostOnce, ShortName = "header",
HelpText = "Data file has header with feature names. Header is read only if options 'hs' and 'hf' are not specified.")]
public bool HasHeader;
public bool HasHeader = Defaults.HasHeader;

/// <summary>
/// Whether to use separate parsing threads.
/// </summary>
[Argument(ArgumentType.AtMostOnce, HelpText = "Use separate parsing threads?", ShortName = "threads", Hide = true)]
public bool UseThreads = true;

/// <summary>
/// If true, new line characters are acceptable inside a quoted field, and thus one field can have multiple lines of text inside it
/// If <see cref="TextLoader.Options.AllowQuoting"/> is false, this option is ignored.
/// </summary>
[Argument(ArgumentType.AtMostOnce, HelpText = "Escape new line characters inside a quoted field? If AllowQuoting is false, this argument is ignored.", ShortName = "multilines", Hide = true)]
public bool ReadMultilines = Defaults.ReadMultilines;

/// <summary>
/// File containing a header with feature names. If specified, the header defined in the data file is ignored regardless of <see cref="HasHeader"/>.
/// </summary>
Expand Down Expand Up @@ -530,6 +537,7 @@ internal static class Defaults
internal const char Separator = '\t';
internal const bool HasHeader = false;
internal const bool TrimWhitespace = false;
internal const bool ReadMultilines = false;
}

/// <summary>
Expand Down Expand Up @@ -694,11 +702,11 @@ public Bindings(TextLoader parent, Column[] cols, IMultiStreamSource headerFile,
ch.Assert(0 <= inputSize & inputSize < SrcLim);
List<ReadOnlyMemory<char>> lines = null;
if (headerFile != null)
Cursor.GetSomeLines(headerFile, 1, ref lines);
Cursor.GetSomeLines(headerFile, 1, parent.ReadMultilines, parent._separators, ref lines);
if (needInputSize && inputSize == 0)
Cursor.GetSomeLines(dataSample, 100, ref lines);
Cursor.GetSomeLines(dataSample, 100, parent.ReadMultilines, parent._separators, ref lines);
else if (headerFile == null && parent.HasHeader)
Cursor.GetSomeLines(dataSample, 1, ref lines);
Cursor.GetSomeLines(dataSample, 1, parent.ReadMultilines, parent._separators, ref lines);

if (needInputSize && inputSize == 0)
{
Expand Down Expand Up @@ -1055,7 +1063,7 @@ private static VersionInfo GetVersionInfo()
// verWrittenCur: 0x00010009, // Introduced _flags
//verWrittenCur: 0x0001000A, // Added ForceVector in Range
//verWrittenCur: 0x0001000B, // Header now retained if used and present
verWrittenCur: 0x0001000C, // Removed Min and Contiguous from KeyType
verWrittenCur: 0x0001000C, // Removed Min and Contiguous from KeyType, and added ReadMultilines flag to OptionFlags
verReadableCur: 0x0001000A,
verWeCanReadBack: 0x00010009,
loaderSignature: LoaderSignature,
Expand All @@ -1073,8 +1081,8 @@ private enum OptionFlags : uint
HasHeader = 0x02,
AllowQuoting = 0x04,
AllowSparse = 0x08,

All = TrimWhitespace | HasHeader | AllowQuoting | AllowSparse
ReadMultilines = 0x10,
All = TrimWhitespace | HasHeader | AllowQuoting | AllowSparse | ReadMultilines
}

// This is reserved to mean the range extends to the end (the segment is variable).
Expand All @@ -1095,6 +1103,11 @@ private bool HasHeader
get { return (_flags & OptionFlags.HasHeader) != 0; }
}

private bool ReadMultilines
{
get { return (_flags & OptionFlags.ReadMultilines) != 0; }
}

private readonly IHost _host;
private const string RegistrationName = "TextLoader";

Expand Down Expand Up @@ -1147,6 +1160,8 @@ internal TextLoader(IHostEnvironment env, Options options = null, IMultiStreamSo
_flags |= OptionFlags.AllowQuoting;
if (options.AllowSparse)
_flags |= OptionFlags.AllowSparse;
if (options.AllowQuoting && options.ReadMultilines)
_flags |= OptionFlags.ReadMultilines;

// REVIEW: This should be persisted (if it should be maintained).
_maxRows = options.MaxRows ?? long.MaxValue;
Expand Down
198 changes: 191 additions & 7 deletions src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
Expand Down Expand Up @@ -145,7 +146,7 @@ public static DataViewRowCursor Create(TextLoader parent, IMultiStreamSource fil
SetupCursor(parent, active, 0, out srcNeeded, out cthd);
Contracts.Assert(cthd > 0);

var reader = new LineReader(files, BatchSize, 100, parent.HasHeader, parent._maxRows, 1);
var reader = new LineReader(files, BatchSize, 100, parent.HasHeader, parent.ReadMultilines, parent._separators, parent._maxRows, 1);
var stats = new ParseStats(parent._host, 1);
return new Cursor(parent, stats, active, reader, srcNeeded, cthd);
}
Expand All @@ -162,7 +163,7 @@ public static DataViewRowCursor[] CreateSet(TextLoader parent, IMultiStreamSourc
SetupCursor(parent, active, n, out srcNeeded, out cthd);
Contracts.Assert(cthd > 0);

var reader = new LineReader(files, BatchSize, 100, parent.HasHeader, parent._maxRows, cthd);
var reader = new LineReader(files, BatchSize, 100, parent.HasHeader, parent.ReadMultilines, parent._separators, parent._maxRows, cthd);
var stats = new ParseStats(parent._host, cthd);
if (cthd <= 1)
return new DataViewRowCursor[1] { new Cursor(parent, stats, active, reader, srcNeeded, 1) };
Expand Down Expand Up @@ -204,7 +205,7 @@ public override ValueGetter<DataViewRowId> GetIdGetter()
};
}

public static void GetSomeLines(IMultiStreamSource source, int count, ref List<ReadOnlyMemory<char>> lines)
public static void GetSomeLines(IMultiStreamSource source, int count, bool readMultilines, char[] separators, ref List<ReadOnlyMemory<char>> lines)
{
Contracts.AssertValue(source);
Contracts.Assert(count > 0);
Expand All @@ -214,7 +215,7 @@ public static void GetSomeLines(IMultiStreamSource source, int count, ref List<R
count = 2;

LineBatch batch;
var reader = new LineReader(source, count, 1, false, count, 1);
var reader = new LineReader(source, count, 1, false, readMultilines, separators, count, 1);
try
{
batch = reader.GetBatch();
Expand Down Expand Up @@ -401,6 +402,8 @@ private sealed class LineReader
{
private readonly long _limit;
private readonly bool _hasHeader;
private readonly bool _readMultilines;
private readonly char[] _separators;
private readonly int _batchSize;
private readonly IMultiStreamSource _files;

Expand All @@ -410,7 +413,7 @@ private sealed class LineReader
private Task _thdRead;
private volatile bool _abort;

public LineReader(IMultiStreamSource files, int batchSize, int bufSize, bool hasHeader, long limit, int cref)
public LineReader(IMultiStreamSource files, int batchSize, int bufSize, bool hasHeader, bool readMultilines, char[] separators, long limit, int cref)
{
// Note that files is allowed to be empty.
Contracts.AssertValue(files);
Expand All @@ -423,6 +426,8 @@ public LineReader(IMultiStreamSource files, int batchSize, int bufSize, bool has
_limit = limit;
_hasHeader = hasHeader;
_batchSize = batchSize;
_readMultilines = readMultilines;
_separators = separators;
_files = files;
_cref = cref;

Expand Down Expand Up @@ -464,6 +469,176 @@ public LineBatch GetBatch()
throw Contracts.ExceptDecode(batch.Exception, "Stream reading encountered exception");
}

private class MultiLineReader
{
private readonly char _sep0;
private readonly char[] _separators;
private readonly bool _sepsContainsSpace;
private readonly StringBuilder _sb;
private readonly TextReader _rdr;

public MultiLineReader(TextReader rdr, char[] separators)
{
Contracts.AssertNonEmpty(separators);
_sep0 = separators[0];
_separators = separators;
_sepsContainsSpace = IsSep(' ');
_sb = new StringBuilder();
_rdr = rdr;
}

// When reading lines that contain quoted fields, the quoted fields can contain
// '\n' so we we'll need to read multiple lines (multilines) to get all the fields
// of a given row.
public string ReadMultiLine(long lineNum, bool ignoreHashLine)
{
string line;
line = _rdr.ReadLine();

// if it was an empty line or if we've reached the end of file (i.e. line = null)
if (string.IsNullOrEmpty(line))
return line;

// In ML.NET we filter out lines beginning with // and # at the beginning of the file
// Or lines beginning with // elsewhere in the file.
// Thus, we don't care to check if there's a quoted multiline when the line begins with
// these chars.
if (line[0] == '/' && line[1] == '/')
return line;
if (ignoreHashLine && line[0] == '#')
return line;
Copy link
Contributor

@justinormont justinormont May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these check if you're at the top of the file? I think once you've hit the first row of data, no more ignoring of rows is allowed. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current behavior is that at the beginning of the file both // and # are ignored by LineReader link, but after the first row of data is hit, LineReader only ignores // link

Because of this, I actually added the ignoreHashLine flag on this method, and use it on the code you've pointed to.

One of the tests I've added on this PR actually shows that "// lines" are ignored correctly throughout the file.


In reply to: 425748062 [](ancestors = 425748062)


// Get more lines until the last field of the line doesn't contain its newline
// inside a quoted field
bool lastFieldIncludesNewLine = LastFieldIncludesNewLine(line, false);
if (!lastFieldIncludesNewLine)
return line;

_sb.Clear();
_sb.Append(line);
while (lastFieldIncludesNewLine)
{
line = _rdr.ReadLine();

if (line == null)
throw new EndOfStreamException($"A quoted field opened on line {lineNum} was never closed, and we've read to the last line in the file without finding the closing quote");

_sb.Append("\n");
_sb.Append(line);
lastFieldIncludesNewLine = LastFieldIncludesNewLine(line, true);
}

return _sb.ToString();
}

// The startsInsideQuoted parameter indicates if the last field of the previous line
// ended in a quoted field which included the newline character,
// if it is true, then the beginning of this line is considered to be part
// of the last field of the previous line.
public bool LastFieldIncludesNewLine(string line, bool startsInsideQuoted = false)
{
if (line.Length == 0)
return startsInsideQuoted;

int ichCur = 0;
int ichLim = line.Length;
bool quotingError = false;

bool ret = FieldIncludesNewLine(ref line, ref ichCur, ichLim, ref quotingError, startsInsideQuoted);
while (ichCur < ichLim)
{
ret = FieldIncludesNewLine(ref line, ref ichCur, ichLim, ref quotingError, false);
if(quotingError)
return false;

// Skip empty fields
while (ichCur < ichLim && IsSep(line[ichCur]))
ichCur++;
}

return ret;
}

private bool FieldIncludesNewLine(ref string line, ref int ichCur, int ichLim,
ref bool quotingError, bool startsInsideQuoted)
{
if (!startsInsideQuoted && !_sepsContainsSpace)
{
// Ignore leading spaces
while (ichCur < ichLim && line[ichCur] == ' ')
ichCur++;
}

if(startsInsideQuoted || line[ichCur] == '"')
{
// Quoted Field Case

if (!startsInsideQuoted)
ichCur++;

for (; ; ichCur++)
{
if (ichCur >= ichLim)
// We've reached the end of the line without finding the closing quote,
// so next line will start on this quoted field
return true;
Comment on lines +572 to +584
Copy link
Member Author

@antoniovs1029 antoniovs1029 May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rewritten the logic in Multilinereader, in order to be more flexible in what we accept (following this comment: #5125 (comment) )

In principle, it's the same as in previous iterations of the PR: simply read every character in the read line, and decide if we should read more lines as part of the current row. But now this code mimics the one found in TextLoader.Parser.HelperImpl.FetchNextField() to actually make a distinction between quoted and non-quoted fields.

So now the behavior has changed for the following scenarios, which are now accepted by Multilinereader without much problem (although the parser might throw or log an error when processing these). Notice that these cases are actually not accepted by the RFC and can be considered invalid formats, so we actually have liberty on how we handle this, and I think the way I've handled it will cause the least negative impact for the users:

  1. We'll accept " inside unquoted fields, and treat those quotes as if they were any other character. This one used to cause the multilinereader to think it was opening a quoted field, and potentially read the rest of the file into this line. Now it knows it's not opening a quoted field.
Label,Size,Age
true,5.4",5
  1. We'll accept characters between a closing quote in a quoted field and the next separator. This case is actually always going to cause a QuotingError on the Parser when parsing the row which will make the parser to return an empty row instead (notice that quoting errors don't cause exceptions). Because of this, I set quotingError as true in here in the Multilinereader, which means that the rest of the row after the invalid closing quote, will be ignored:
Label,Size,Age
true,"5."4,5 // this will be read as a single line, and then the parser will return it as empty
false,"2.3",8// this will be read correctly
  1. A positive side effect of point number 2, including the quotingError mechanism, is that now it's much harder for the multilinereader to read a complete file when there's a quoting error. Now if there's a quoted field without closing quote, it will read it until the next quote, and will consider the rest of that line and all the read lines as a single row (which will produce a QuotingError on the parser). And so the following lines will remain unaffected. Previous behavior for this particular case would actually have been to load the rest of the file as part of the first row:
// the first 3 rows will be read as single row, and later the parser will have a quoting error there
// after that all rows will be read correctly.
Label,Size,Age
true,"5.4,5 
true,2,4
false,"2.3",8
true,"3.5",67
...
// Same behavior as above:
Label,Size,Age
true,"5.4,5 
true,2,4
false,"2.3,8
true,"3.5",67
...


if (line[ichCur] == '"')
{
if (++ichCur >= ichLim)
// Last character in line was the closing quote of the field
return false;

if (line[ichCur] == '"')
// 2 Double quotes means escaped quote
continue;

// If it wasn't an escaped quote, then this is supposed to be
// the closing quote of the field, and there should only be spaces remaining
// until the next separator.

if (!_sepsContainsSpace)
{
// Ignore leading spaces
while (ichCur < ichLim && line[ichCur] == ' ')
ichCur++;
}

// If there's anything else than spaces or the next separator,
// this will actually be a QuotingError on the parser, so we decide that this
// line contains a quoting error, and so it's not going to be considered a valid field
// and the rest of the line should be ignored.
if (ichCur >= ichLim || IsSep(line[ichCur]))
return false;

quotingError = true;
return false;
}
}
}

// Unquoted field case.
// An unquoted field shouldn't contain new lines
while(ichCur < ichLim && !IsSep(line[ichCur]))
{
ichCur++;
}
return false;
}

private bool IsSep(char ch)
{
Copy link
Contributor

@harishsk harishsk May 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is line.ToCharArray().Count(c => c == ch) a better approach? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it, but I read here that actually the most efficient way is to iterate over the chars:
https://stackoverflow.com/a/10391548


In reply to: 425452787 [](ancestors = 425452787)

Copy link
Contributor

@justinormont justinormont May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If perf testing shows excessive time spent in this function, you could argument the existing C++/assembly intrinsics w/ a character counter. I think _mm256_cmpeq_epi8 would work to accelerate the comparison. #Resolved

if (ch == _sep0)
return true;
for (int i = 1; i < _separators.Length; i++)
{
if (ch == _separators[i])
return true;
}
return false;
}
}

private void ThreadProc()
{
Contracts.Assert(_batchSize >= 2);
Expand All @@ -480,14 +655,19 @@ private void ThreadProc()
string path = _files.GetPathOrNull(ifile);
using (var rdr = _files.OpenTextReader(ifile))
{
var multilineReader = new MultiLineReader(rdr, _separators);
string text;
long line = 0;
for (; ; )
{
// REVIEW: Avoid allocating a string for every line. This would probably require
// introducing a CharSpan type (similar to ReadOnlyMemory but based on char[] or StringBuilder)
// and implementing all the necessary conversion functionality on it. See task 3871.
text = rdr.ReadLine();
if (_readMultilines)
text = multilineReader.ReadMultiLine(line, true);
else
text = rdr.ReadLine();

if (text == null)
goto LNext;
line++;
Expand All @@ -514,7 +694,11 @@ private void ThreadProc()
if (_abort)
return;

text = rdr.ReadLine();
if (_readMultilines)
text = multilineReader.ReadMultiLine(line, false);
else
text = rdr.ReadLine();

if (text == null)
{
// We're done with this file. Queue the last partial batch.
Expand Down
5 changes: 5 additions & 0 deletions src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderParser.cs
Original file line number Diff line number Diff line change
Expand Up @@ -1161,6 +1161,11 @@ private bool FetchNextField(ref ScanInfo scan, ReadOnlySpan<char> span)
scan.QuotingError = true;
break;
}

// The logic below allow us to escape quotes (") inside quoted
// fields by using doublo quotes (""). I.e. when the loader
// encounters "" inside a quoted field, it will output only one "
// and continue parsing the rest of the field.
if (span[ichCur] == '"')
Copy link
Contributor

@justinormont justinormont May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to expose this character as a parameter. The other common choice is backslash escaping. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To implement the feature you're suggesting on this comment I do believe that we only need to change the code you've pointed to in here, so it's kind of straightforward.

Right now I am going to focus on making sure this PR works well with AutoML/Modelbuilder, that TextSaver works with new lines and that we handle correctly the case of having a badly formatted input file. If I can tackle this with enough time before the next release, then I will try to implement this suggestion.


In reply to: 425777440 [](ancestors = 425777440)

Copy link
Member Author

@antoniovs1029 antoniovs1029 May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change in plans, I won't address surfacing the new readMultiline option to AutoML/ModelBuilder on this PR. So now I will actually focus on the other things, including implementing this escapechar option before the next release. See #5125 (comment)


In reply to: 425777440 [](ancestors = 425777440)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself: I've just realized that to implement the escapechar feature, then I'll also need to change the logic in the MultilineReader, because if the escapechar isn't " then checking for even number of quotes won't be enough.


In reply to: 425777440 [](ancestors = 425777440)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I'll address this in an upcoming PR.


In reply to: 426128704 [](ancestors = 426128704,425777440)

{
if (ichCur > ichRun)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ public static TextLoader CreateTextLoader(this DataOperationsCatalog catalog,
HasHeader = hasHeader,
AllowQuoting = allowQuoting,
TrimWhitespace = trimWhitespace,
AllowSparse = allowSparse
AllowSparse = allowSparse,
};

return new TextLoader(CatalogUtils.GetEnvironment(catalog), options: options, dataSample: dataSample);
Expand Down
Loading