Skip to content

Add PartitionedFileLoader #61

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Jun 7, 2018
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
a07091b
Add PartitionedFileLoader
tyclintw May 7, 2018
67a358c
Roll back to the original DataType.
tyclintw May 7, 2018
0bc8a2b
Address comments.
tyclintw May 11, 2018
ce3edce
Add exception handling for failed loader.
tyclintw May 11, 2018
eebf207
Merge branch 'master' into tyclintw/partitionedloader
tyclintw May 11, 2018
bcd4aad
Fix Generator issues. This is change is a hack and will be addressed …
tyclintw May 11, 2018
748ffe7
Address comments.
tyclintw May 11, 2018
4549388
Remove unused namespaces.
tyclintw May 11, 2018
781a45e
Move subLoader to a byteArray so we aren't recreating with args.
tyclintw May 15, 2018
e54698a
Save and load ISchema directly instead of the Column [].
tyclintw May 15, 2018
885ff30
Update help text for clarity.
tyclintw May 15, 2018
bbf8de8
Fix linux test failures.
tyclintw May 15, 2018
9a2d641
Merge branch 'master' into tyclintw/partitionedloader
tyclintw May 16, 2018
225b7ee
Force path output to be unix formatted for consistency between OS tests.
tyclintw May 16, 2018
4497c35
Fix ZBaselines release folder name.
tyclintw May 16, 2018
1e01903
Sort file listings to guarantee "Expand" ordering across operating sy…
tyclintw May 16, 2018
10f47b4
Whitespace.
tyclintw May 16, 2018
90bedc4
Move test files from Samples to test/data
tyclintw May 22, 2018
c0467e6
Address comments.
tyclintw May 22, 2018
674b5cb
Modify exception handling to use Contracts instead.
tyclintw May 23, 2018
5265c90
Add UnescapeDataString to realtive path method.
tyclintw May 23, 2018
a040b51
Rename PathUtils to prevent name conflicts.
tyclintw May 23, 2018
fe6ca03
Address comments.
tyclintw May 23, 2018
097086f
Fix ExceptParam call.
tyclintw May 24, 2018
fe3229f
Merge branch 'master' into tyclintw/partitionedloader
tyclintw May 30, 2018
d3997cb
Merge branch 'master' into tyclintw/partitionedloader
tyclintw May 30, 2018
3c5d6d8
Merge branch 'tyclintw/partitionedloader' of https://github.com/tycli…
tyclintw May 30, 2018
5edd446
Merge branch 'master' into tyclintw/partitionedloader
tyclintw May 31, 2018
c1a5897
Move ZBaselines to new test\BaselineOutput location.
tyclintw May 31, 2018
7d09e32
Merge branch 'tyclintw/partitionedloader' of https://github.com/tycli…
tyclintw May 31, 2018
d9905bb
address comments
tyclintw Jun 1, 2018
ec92ecd
Modify all Exceptions to use Contracts.Exception.
tyclintw Jun 5, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
134 changes: 134 additions & 0 deletions ZBaselines/Common/EntryPoints/core_manifest.json
Original file line number Diff line number Diff line change
Expand Up @@ -21686,6 +21686,140 @@
}
]
},
{
"Kind": "PartitionedPathParser",
"Components": [
{
"Name": "ParquetPathParser",
"Desc": "Extract name/value pairs from Parquet formatted directory names. Example path: Year=2018/Month=12/data1.parquet",
"FriendlyName": "Parquet Partitioned Path Parser",
"Aliases": [
"ParqPP"
],
"Settings": []
},
{
"Name": "SimplePathParser",
"Desc": "A simple parser that extracts directory names as column values. Column names are defined as arguments.",
"FriendlyName": "Simple Partitioned Path Parser",
"Aliases": [
"SmplPP"
],
"Settings": [
{
"Name": "Columns",
"Type": {
"Kind": "Array",
"ItemType": {
"Kind": "Struct",
"Fields": [
{
"Name": "Name",
"Type": "String",
"Desc": "Name of the column.",
"Required": true,
"SortOrder": 150.0,
"IsNullable": false
},
{
"Name": "Type",
"Type": {
"Kind": "Enum",
"Values": [
"I1",
"U1",
"I2",
"U2",
"I4",
"U4",
"I8",
"U8",
"R4",
"Num",
"R8",
"TX",
"Text",
"TXT",
"BL",
"Bool",
"TimeSpan",
"TS",
"DT",
"DateTime",
"DZ",
"DateTimeZone",
"UG",
"U16"
]
},
"Desc": "Data type of the column.",
"Required": false,
"SortOrder": 150.0,
"IsNullable": true,
"Default": null
},
{
"Name": "Source",
"Type": "Int",
"Desc": "Index of the directory representing this column.",
"Required": true,
"SortOrder": 150.0,
"IsNullable": false,
"Default": 0
}
]
}
},
"Desc": "Column definitions used to override the Partitioned Path Parser. Expected with the format name:type:numeric-source, e.g. col=MyFeature:R4:1",
"Aliases": [
"col"
],
"Required": false,
"SortOrder": 1.0,
"IsNullable": false,
"Default": null
},
{
"Name": "Type",
"Type": {
"Kind": "Enum",
"Values": [
"I1",
"U1",
"I2",
"U2",
"I4",
"U4",
"I8",
"U8",
"R4",
"Num",
"R8",
"TX",
"Text",
"TXT",
"BL",
"Bool",
"TimeSpan",
"TS",
"DT",
"DateTime",
"DZ",
"DateTimeZone",
"UG",
"U16"
]
},
"Desc": "Data type of each column.",
"Required": false,
"SortOrder": 150.0,
"IsNullable": false,
"Default": "TX"
}
]
}
]
},
{
"Kind": "RegressionLossFunction",
"Components": [
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#@ TextLoader{
#@ header+
#@ sep=tab
#@ col=L0:TX:0
#@ col=Year:TX:1
#@ col=Month:TX:2
#@ }
L0 Year Month
0 2017 01
4 2017 01
6 2017 01
21 2017 02
23 2017 02
25 2017 02
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---- PartitionedFileLoader ----
3 columns:
L0: Text
Year: Text
Month: Text
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#@ TextLoader{
#@ header+
#@ sep=tab
#@ col=L0:I4:0
#@ col=Month:I4:1
#@ col=Path:TX:2
#@ }
L0 Month Path
1 1 2017/01/data1.csv
5 1 2017/01/data2.csv
7 1 2017/01/data2.csv
0 1 2017/01/dataBadSchema.csv
0 1 2017/01/dataBadSchema.csv
22 2 2017/02/data1.csv
24 2 2017/02/data1.csv
26 2 2017/02/data1.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---- PartitionedFileLoader ----
3 columns:
L0: I4
Month: I4
Path: Text
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#@ TextLoader{
#@ header+
#@ sep=tab
#@ col=L0:TX:0
#@ col=Year:TX:1
#@ col=Month:TX:2
#@ }
L0 Year Month
0 2017 01
4 2017 01
6 2017 01
21 2017 02
23 2017 02
25 2017 02
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---- PartitionedFileLoader ----
3 columns:
L0: Text
Year: Text
Month: Text
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#@ TextLoader{
#@ header+
#@ sep=tab
#@ col=L0:I4:0
#@ col=Month:I4:1
#@ col=Path:TX:2
#@ }
L0 Month Path
1 1 2017/01/data1.csv
5 1 2017/01/data2.csv
7 1 2017/01/data2.csv
0 1 2017/01/dataBadSchema.csv
0 1 2017/01/dataBadSchema.csv
22 2 2017/02/data1.csv
24 2 2017/02/data1.csv
26 2 2017/02/data1.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---- PartitionedFileLoader ----
3 columns:
L0: I4
Month: I4
Path: Text
10 changes: 7 additions & 3 deletions src/Microsoft.ML.Core/Utilities/PathUtils.cs
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
// Licensed to the .NET Foundation under one or more agreements.
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

Expand Down Expand Up @@ -67,13 +67,17 @@ public static string FindExistentFileOrNull(string fileName, string folderPrefix
// 1. Search in customSearchDir.
if (!string.IsNullOrWhiteSpace(customSearchDir)
&& TryFindFile(fileName, folderPrefix, customSearchDir, out candidate))
return candidate;
{
return candidate;
}

Copy link
Contributor

@TomFinley TomFinley May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. I know you didn't write this code, but since the if condition is on multiple lines it really ought to be bracketed... if you have time could you fix it? The below if condition has the same problem I'm afraid. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. I'm always happy to clean up stuff.

// 2. Search in the path specified by the environment variable.
var envDir = Environment.GetEnvironmentVariable(CustomSearchDirEnvVariable);
if (!string.IsNullOrWhiteSpace(envDir)
&& TryFindFile(fileName, folderPrefix, envDir, out candidate))
return candidate;
{
return candidate;
}

// 3. Search in the path specified by the assemblyForBasePath.
if (assemblyForBasePath != null)
Expand Down
14 changes: 14 additions & 0 deletions src/Microsoft.ML.Data/Commands/DataCommand.cs
Original file line number Diff line number Diff line change
Expand Up @@ -396,6 +396,20 @@ public static void SaveLoader(IDataLoader loader, IFileHandle file)
Contracts.CheckParam(file.CanWrite, nameof(file), "Must be writable");

using (var stream = file.CreateWriteStream())
{
SaveLoader(loader, stream);
}
}

/// <summary>
/// Saves <paramref name="loader"/> to the specified <paramref name="stream"/>.
/// </summary>
public static void SaveLoader(IDataLoader loader, Stream stream)
{
Contracts.CheckValue(loader, nameof(loader));
Contracts.CheckValue(stream, nameof(stream));
Contracts.CheckParam(stream.CanWrite, nameof(stream), "Must be writable");

using (var rep = RepositoryWriter.CreateNew(stream))
{
ModelSaveContext.SaveModel(rep, loader, ModelFileUtils.DirDataLoaderModel);
Expand Down
Loading