-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Make separator char[] everywhere (previous type is char sometime) #2702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderSaverCatalog.cs
Outdated
Show resolved
Hide resolved
Codecov Report
@@ Coverage Diff @@
## master #2702 +/- ##
=========================================
Coverage ? 71.58%
=========================================
Files ? 805
Lines ? 142001
Branches ? 16125
=========================================
Hits ? 101651
Misses ? 35914
Partials ? 4436
|
@@ -1064,15 +1065,15 @@ private bool HasHeader | |||
/// </summary> | |||
/// <param name="env">The environment to use.</param> | |||
/// <param name="columns">Defines a mapping between input columns in the file and IDataView columns.</param> | |||
/// <param name="separatorChar"> The character used as separator between data points in a row. By default the tab character is used as separator.</param> | |||
/// <param name="separators"> The character used as separator between data points in a row. {'\t'} will be used if not specified.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The character used [](start = 38, length = 18)
Array of characters. #Resolved
@@ -1064,15 +1065,15 @@ private bool HasHeader | |||
/// </summary> | |||
/// <param name="env">The environment to use.</param> | |||
/// <param name="columns">Defines a mapping between input columns in the file and IDataView columns.</param> | |||
/// <param name="separatorChar"> The character used as separator between data points in a row. By default the tab character is used as separator.</param> | |||
/// <param name="separators"> The character used as separator between data points in a row. {'\t'} will be used if not specified.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{'\t'} [](start = 100, length = 6)
what is wrong with tab
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, what is difference from user standpoint between array with one element and scalar?
In reply to: 259559041 [](ancestors = 259559041,259558700)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Their types are obviously different, right? Users can't specify a tab when an array is required.
In reply to: 259559198 [](ancestors = 259559198,259559041,259558700)
@@ -45,7 +45,7 @@ public static class TextLoaderSaverCatalog | |||
/// Create a text loader <see cref="TextLoader"/> by inferencing the dataset schema from a data model type. | |||
/// </summary> | |||
/// <param name="catalog">The <see cref="DataOperationsCatalog"/> catalog.</param> | |||
/// <param name="separatorChar">Column separator character. Default is '\t'</param> | |||
/// <param name="separators">Column separator characters. {'\t'} will be used if not specified.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Column separator characters. [](start = 37, length = 28)
Please be consistent with description for same property across different methods. #Resolved
Since the most common scenario is that we only have one separator, and not multiple separators I think we should add an overload with a |
That I am not sure. This sounds a In reply to: 466604514 [](ancestors = 466604514) |
There are quite a few instances in which we added an overload in the catalog extensions. machinelearning/src/Microsoft.ML.Data/Transforms/ExtensionsCatalog.cs Lines 108 to 109 in 44c3113
The old version of In fact it is quite common in our extensions to have a "simple" scenario and an advanced options scenario with two different extensions for the two scenarios. |
Actually now that I take a deeper look at it, I would keep the single |
I don't quite agree. We have only one separator because all our text files are cleaned. Is this assumption applied to other places? In reply to: 466606912 [](ancestors = 466606912) |
Hi @wschin thanks for working on this. I will say that my initial thinking was that "multi-separator" files were fairly rare. (People know what TSV files are, people know what CSV files are, but I don't think too many people know or would be sympathetic to the idea of the "T or C SV" file.) For that reason, I think while the advanced options should continue to support multiple characters (I do know it does happen), the convenience API should keep the separator as Also including @sfilipi and @rogancarr. If they think multi-separator files are common I will desist, but otherwise I would prefer to keep the simple (and common) thing simple. |
I'm pretty sure. When I look back on DRI responses internally, I don't see the multi-separate capability used too often, even for user data. Sometimes it is relevant -- it was added for a reason -- but I wouldn't say commonly. Also its presence here significantly complicates the simpler scenario, since writing |
Ok, so I will close this PR. In reply to: 467074220 [](ancestors = 467074220) |
Fix #2472. This PR makes all separators a
char[]
instead ofchar
in the public area ofTextLoader
. Note thatTextLoader
's advancedoptions
is already usingchar[]
.