Skip to content

Add PartitionedFileLoader #61

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Jun 7, 2018
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
a07091b
Add PartitionedFileLoader
tyclintw May 7, 2018
67a358c
Roll back to the original DataType.
tyclintw May 7, 2018
0bc8a2b
Address comments.
tyclintw May 11, 2018
ce3edce
Add exception handling for failed loader.
tyclintw May 11, 2018
eebf207
Merge branch 'master' into tyclintw/partitionedloader
tyclintw May 11, 2018
bcd4aad
Fix Generator issues. This is change is a hack and will be addressed …
tyclintw May 11, 2018
748ffe7
Address comments.
tyclintw May 11, 2018
4549388
Remove unused namespaces.
tyclintw May 11, 2018
781a45e
Move subLoader to a byteArray so we aren't recreating with args.
tyclintw May 15, 2018
e54698a
Save and load ISchema directly instead of the Column [].
tyclintw May 15, 2018
885ff30
Update help text for clarity.
tyclintw May 15, 2018
bbf8de8
Fix linux test failures.
tyclintw May 15, 2018
9a2d641
Merge branch 'master' into tyclintw/partitionedloader
tyclintw May 16, 2018
225b7ee
Force path output to be unix formatted for consistency between OS tests.
tyclintw May 16, 2018
4497c35
Fix ZBaselines release folder name.
tyclintw May 16, 2018
1e01903
Sort file listings to guarantee "Expand" ordering across operating sy…
tyclintw May 16, 2018
10f47b4
Whitespace.
tyclintw May 16, 2018
90bedc4
Move test files from Samples to test/data
tyclintw May 22, 2018
c0467e6
Address comments.
tyclintw May 22, 2018
674b5cb
Modify exception handling to use Contracts instead.
tyclintw May 23, 2018
5265c90
Add UnescapeDataString to realtive path method.
tyclintw May 23, 2018
a040b51
Rename PathUtils to prevent name conflicts.
tyclintw May 23, 2018
fe6ca03
Address comments.
tyclintw May 23, 2018
097086f
Fix ExceptParam call.
tyclintw May 24, 2018
fe3229f
Merge branch 'master' into tyclintw/partitionedloader
tyclintw May 30, 2018
d3997cb
Merge branch 'master' into tyclintw/partitionedloader
tyclintw May 30, 2018
3c5d6d8
Merge branch 'tyclintw/partitionedloader' of https://github.com/tycli…
tyclintw May 30, 2018
5edd446
Merge branch 'master' into tyclintw/partitionedloader
tyclintw May 31, 2018
c1a5897
Move ZBaselines to new test\BaselineOutput location.
tyclintw May 31, 2018
7d09e32
Merge branch 'tyclintw/partitionedloader' of https://github.com/tycli…
tyclintw May 31, 2018
d9905bb
address comments
tyclintw Jun 1, 2018
ec92ecd
Modify all Exceptions to use Contracts.Exception.
tyclintw Jun 5, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions Samples/Partitioned/Named/Year=2017/Month=01/data1.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
col1, col2
0, 1
3 changes: 3 additions & 0 deletions Samples/Partitioned/Named/Year=2017/Month=01/data2.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
col1, col2
4, 5
6, 7
1 change: 1 addition & 0 deletions Samples/Partitioned/Named/Year=2017/Month=01/dataEmpty.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
col1, col2
4 changes: 4 additions & 0 deletions Samples/Partitioned/Named/Year=2017/Month=02/data1.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
col1, col2
21, 22
23, 24
25, 26
4 changes: 4 additions & 0 deletions Samples/Partitioned/Named/Year=2017/TestBadDir/data1.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
col1, col2
21, 22
23, 24
25, 26
2 changes: 2 additions & 0 deletions Samples/Partitioned/Unnamed/2017/01/data1.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
col1, col2
0, 1
3 changes: 3 additions & 0 deletions Samples/Partitioned/Unnamed/2017/01/data2.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
col1, col2
4, 5
6, 7
3 changes: 3 additions & 0 deletions Samples/Partitioned/Unnamed/2017/01/dataBadSchema.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
col1
11
12
4 changes: 4 additions & 0 deletions Samples/Partitioned/Unnamed/2017/02/data1.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
col1, col2
21, 22
23, 24
25, 26
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#@ TextLoader{
#@ header+
#@ sep=tab
#@ col=L0:TX:0
#@ col=Year:TX:1
#@ col=Month:TX:2
#@ }
L0 Year Month
0 2017 01
4 2017 01
6 2017 01
21 2017 02
23 2017 02
25 2017 02
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---- PartitionedFileLoader ----
3 columns:
L0: Text
Year: Text
Month: Text
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#@ TextLoader{
#@ header+
#@ sep=tab
#@ col=L0:I4:0
#@ col=Month:I4:1
#@ col=Path:TX:2
#@ }
L0 Month Path
1 1 2017\01\data1.csv
5 1 2017\01\data2.csv
7 1 2017\01\data2.csv
0 1 2017\01\dataBadSchema.csv
0 1 2017\01\dataBadSchema.csv
22 2 2017\02\data1.csv
24 2 2017\02\data1.csv
26 2 2017\02\data1.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---- PartitionedFileLoader ----
3 columns:
L0: I4
Month: I4
Path: Text
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#@ TextLoader{
#@ header+
#@ sep=tab
#@ col=L0:TX:0
#@ col=Year:TX:1
#@ col=Month:TX:2
#@ }
L0 Year Month
0 2017 01
4 2017 01
6 2017 01
21 2017 02
23 2017 02
25 2017 02
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---- PartitionedFileLoader ----
3 columns:
L0: Text
Year: Text
Month: Text
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#@ TextLoader{
#@ header+
#@ sep=tab
#@ col=L0:I4:0
#@ col=Month:I4:1
#@ col=Path:TX:2
#@ }
L0 Month Path
1 1 2017\01\data1.csv
5 1 2017\01\data2.csv
7 1 2017\01\data2.csv
0 1 2017\01\dataBadSchema.csv
0 1 2017\01\dataBadSchema.csv
22 2 2017\02\data1.csv
24 2 2017\02\data1.csv
26 2 2017\02\data1.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---- PartitionedFileLoader ----
3 columns:
L0: I4
Month: I4
Path: Text
66 changes: 63 additions & 3 deletions src/Microsoft.ML.Core/Utilities/PathUtils.cs
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
// Licensed to the .NET Foundation under one or more agreements.
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Threading;

namespace Microsoft.ML.Runtime.Internal.Utilities
Expand Down Expand Up @@ -67,13 +69,13 @@ public static string FindExistentFileOrNull(string fileName, string folderPrefix
// 1. Search in customSearchDir.
if (!string.IsNullOrWhiteSpace(customSearchDir)
&& TryFindFile(fileName, folderPrefix, customSearchDir, out candidate))
return candidate;
return candidate;

Copy link
Contributor

@TomFinley TomFinley May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. I know you didn't write this code, but since the if condition is on multiple lines it really ought to be bracketed... if you have time could you fix it? The below if condition has the same problem I'm afraid. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. I'm always happy to clean up stuff.

// 2. Search in the path specified by the environment variable.
var envDir = Environment.GetEnvironmentVariable(CustomSearchDirEnvVariable);
if (!string.IsNullOrWhiteSpace(envDir)
&& TryFindFile(fileName, folderPrefix, envDir, out candidate))
return candidate;
return candidate;

// 3. Search in the path specified by the assemblyForBasePath.
if (assemblyForBasePath != null)
Expand Down Expand Up @@ -139,5 +141,63 @@ public static string CreateFolderIfNotExists(string folder)

return null;
}

/// <summary>
/// Make a full path realtive to a base path.
/// </summary>
/// <param name="basepath">The base path, assumed to be a directory.</param>
/// <param name="path">The full path.</param>
/// <returns>The relative path.</returns>
/// <exception cref="ArgumentException">If the paths are not relative.</exception>
public static string MakePathRelative(string basepath, string path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MakePathRelative [](start = 29, length = 16)

We might prefer to have at least two distinct usages of some facility before it is "promoted" to utilities, for two practical reasons:

  1. If there's only usage from one assembly, it could have been an internal method and not increased our public surface.

  2. Adding it to the public surface places a bit more emphasis on making sure that it is "right." My confidence that we got the "right" abstraction for any hypothetical future usage is somewhat low.

What do we think about keeping this internal so that it can be used from PartitionedFileLoader and Parser, but no further?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no object to this, however I can't just make the methods internal since the usage is in a different assembly. So I had to move it to a different file to be internal.

Copy link
Contributor

@TomFinley TomFinley May 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, right, that's what I meant, thanks Tyler -- maybe made with different class name to avoid name collisions in that namesapce though.


In reply to: 190063880 [](ancestors = 190063880)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved.

{
Contracts.AssertNonEmpty(basepath);
Contracts.AssertNonEmpty(path);

Uri baseUri = new Uri(basepath);
Uri uri = new Uri(path);

if (baseUri.Scheme != uri.Scheme)
{
throw new ArgumentException("Paths cannot be made relative as they are of different schemas.");
Copy link
Contributor

@glebuk glebuk May 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schemas [](start = 100, length = 7)

schemE #Closed

Copy link
Contributor

@glebuk glebuk May 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArgumentException [](start = 26, length = 17)

can se use host.except instead of throw? Might need to have to pass env or host.

Copy link
Contributor Author

@tyclintw tyclintw May 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a Util class, it's my opinion it should be unaware of host. It would then be incumbent on the caller (who is aware of host) to pass the exception on to host.except. Is there a strong reason to avoid thowing that I'm not aware of? #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty please


In reply to: 187689960 [](ancestors = 187689960)

Copy link
Contributor

@TomFinley TomFinley May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Tyler! While we generally prefer that exceptions be thrown through host so that the context information of the call is preserved, we have not done this with the utils classes for one reason or another. So I think it's fine that you haven't.

However, it is still important that the exception be marked. E.g.: use Contracts.ExceptParam here. This is how, say, the command line or GUI distinguishes between an error we are raising (which we don't necessarily consider a bug, but potentially a sign of misused) vs. unmarked exceptions. The latter, we always consider bugs.

Note that we still use Contracts.Check and Contracts.Except in places for this sort of thing, in situations where an IExceptionContext of some form or other is not present. It is for this reason. (Indeed, prior to the introduction of the exception contexts this was the only mechanism we had.)


In reply to: 187710640 [](ancestors = 187710640)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if you're using Contracts instead, it would be something like:

Contracts.CheckParam(baseUri.Scheme == uri.Scheme, nameof(basepath), "Paths cannot be made relative as they are of different schemes.");

Similar for the below, maybe.


In reply to: 189994300 [](ancestors = 189994300,187710640)

}

string relativePath;
try
{
if (!baseUri.AbsoluteUri.EndsWith("/"))
{
baseUri = new Uri(baseUri.AbsoluteUri + "/");
}

relativePath = baseUri.MakeRelativeUri(uri).ToString();
}
catch (ArgumentNullException e)
{
throw new ArgumentException("Paths could not be made relative.", e);
}
catch (InvalidOperationException e)
{
throw new ArgumentException("Paths could not be made relative.", e);
}

if (uri.Scheme.Equals("file", StringComparison.InvariantCultureIgnoreCase))
Copy link
Contributor

@TomFinley TomFinley May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringComparison.InvariantCultureIgnoreCase [](start = 42, length = 43)

It is generally best practice to use StringComparison.OrdinalIgnoreCase for this type of situation. See here, in particular, the bulleted list under "recommendations for string usage" has some words on when to use ordinal vs. invariant. (The situations where invariant is appropriate are extremely limited.) #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip.

{
relativePath = relativePath.Replace(Path.AltDirectorySeparatorChar, Path.DirectorySeparatorChar);
}

return relativePath;
}

/// <summary>
/// Split a path string into an enumerable list of the directories.
/// </summary>
/// <param name="path">The path string to split.</param>
/// <returns>An enumerable list of all non-empty directories.</returns>
public static IEnumerable<string> SplitDirectories(string path)
{
var cleanPath = path.Replace(Path.AltDirectorySeparatorChar, Path.DirectorySeparatorChar);
return cleanPath.Split(new char[] { Path.DirectorySeparatorChar }, StringSplitOptions.RemoveEmptyEntries);
Copy link
Contributor

@glebuk glebuk May 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new char[] { Path.DirectorySeparatorChar } [](start = 35, length = 42)

this creates garbage per call. Can we create a const/static var to recycle this char array? #Closed

}
}
}
Loading