Skip to content

Make separator char[] everywhere (previous type is char sometime) #2702

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

wschin
Copy link
Member

@wschin wschin commented Feb 22, 2019

Fix #2472. This PR makes all separators a char[] instead of char in the public area of TextLoader. Note that TextLoader's advanced options is already using char[].

@wschin wschin added the API Issues pertaining the friendly API label Feb 22, 2019
@wschin wschin self-assigned this Feb 22, 2019
@codecov
Copy link

codecov bot commented Feb 23, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@b604b07). Click here to learn what that means.
The diff coverage is 95.34%.

@@            Coverage Diff            @@
##             master    #2702   +/-   ##
=========================================
  Coverage          ?   71.58%           
=========================================
  Files             ?      805           
  Lines             ?   142001           
  Branches          ?    16125           
=========================================
  Hits              ?   101651           
  Misses            ?    35914           
  Partials          ?     4436
Flag Coverage Δ
#Debug 71.58% <95.34%> (?)
#production 67.87% <84.61%> (?)
#test 85.73% <100%> (?)
Impacted Files Coverage Δ
...c/Microsoft.ML.SamplesUtils/SamplesDatasetUtils.cs 26.77% <0%> (ø)
...rios/Api/Estimators/DecomposableTrainAndPredict.cs 100% <100%> (ø)
...osoft.ML.Functional.Tests/Datasets/TypeTestData.cs 96.47% <100%> (ø)
...ios/IrisPlantClassificationWithStringLabelTests.cs 98.63% <100%> (ø)
...L.Tests/Scenarios/Api/Estimators/Metacomponents.cs 100% <100%> (ø)
...ML.Tests/Scenarios/Api/Estimators/Extensibility.cs 100% <100%> (ø)
test/Microsoft.ML.Tests/OnnxConversionTest.cs 97.14% <100%> (ø)
...ML.Tests/TrainerEstimators/MetalinearEstimators.cs 100% <100%> (ø)
...cenariosWithDirectInstantiation/TensorflowTests.cs 91.76% <100%> (ø)
...sts/Scenarios/Api/Estimators/PredictAndMetadata.cs 100% <100%> (ø)
... and 6 more

@@ -1064,15 +1065,15 @@ private bool HasHeader
/// </summary>
/// <param name="env">The environment to use.</param>
/// <param name="columns">Defines a mapping between input columns in the file and IDataView columns.</param>
/// <param name="separatorChar"> The character used as separator between data points in a row. By default the tab character is used as separator.</param>
/// <param name="separators"> The character used as separator between data points in a row. {'\t'} will be used if not specified.</param>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The character used [](start = 38, length = 18)

Array of characters. #Resolved

@@ -1064,15 +1065,15 @@ private bool HasHeader
/// </summary>
/// <param name="env">The environment to use.</param>
/// <param name="columns">Defines a mapping between input columns in the file and IDataView columns.</param>
/// <param name="separatorChar"> The character used as separator between data points in a row. By default the tab character is used as separator.</param>
/// <param name="separators"> The character used as separator between data points in a row. {'\t'} will be used if not specified.</param>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{'\t'} [](start = 100, length = 6)

what is wrong with tab ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tab is scalar, not an array.


In reply to: 259558700 [](ancestors = 259558700)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, what is difference from user standpoint between array with one element and scalar?


In reply to: 259559041 [](ancestors = 259559041,259558700)

Copy link
Member Author

@wschin wschin Feb 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Their types are obviously different, right? Users can't specify a tab when an array is required.


In reply to: 259559198 [](ancestors = 259559198,259559041,259558700)

@@ -45,7 +45,7 @@ public static class TextLoaderSaverCatalog
/// Create a text loader <see cref="TextLoader"/> by inferencing the dataset schema from a data model type.
/// </summary>
/// <param name="catalog">The <see cref="DataOperationsCatalog"/> catalog.</param>
/// <param name="separatorChar">Column separator character. Default is '\t'</param>
/// <param name="separators">Column separator characters. {'\t'} will be used if not specified.</param>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column separator characters. [](start = 37, length = 28)

Please be consistent with description for same property across different methods. #Resolved

@artidoro
Copy link
Contributor

Since the most common scenario is that we only have one separator, and not multiple separators I think we should add an overload with a separatorChar instead of a separators array of chars.

@wschin
Copy link
Member Author

wschin commented Feb 23, 2019

That I am not sure. This sounds a convenient feature because it's already doable. In addition, I feel we should have only one way to do one thing.


In reply to: 466604514 [](ancestors = 466604514)

@artidoro
Copy link
Contributor

There are quite a few instances in which we added an overload in the catalog extensions.
See for example DropColumns and KeepColumns (

public static ColumnSelectingEstimator DropColumns(this TransformsCatalog catalog, params string[] columnsToDrop)
=> ColumnSelectingEstimator.DropColumns(CatalogUtils.GetEnvironment(catalog), columnsToDrop);
).

The old version of TextLoader had a separatorchar for a reason: it is quite annoying to have to define an array of a single character every time you use TextLoader.

In fact it is quite common in our extensions to have a "simple" scenario and an advanced options scenario with two different extensions for the two scenarios.

@artidoro
Copy link
Contributor

artidoro commented Feb 23, 2019

Actually now that I take a deeper look at it, I would keep the single char separatorChar everywhere for the simple scenario where we specify the arguments directly in the extension, while we should make sure we have char[] separators in the Options class that is used as an argument in the advanced scenario.

@wschin
Copy link
Member Author

wschin commented Feb 23, 2019

I don't quite agree. We have only one separator because all our text files are cleaned. Is this assumption applied to other places?


In reply to: 466606912 [](ancestors = 466606912)

@TomFinley
Copy link
Contributor

TomFinley commented Feb 24, 2019

Hi @wschin thanks for working on this. I will say that my initial thinking was that "multi-separator" files were fairly rare. (People know what TSV files are, people know what CSV files are, but I don't think too many people know or would be sympathetic to the idea of the "T or C SV" file.) For that reason, I think while the advanced options should continue to support multiple characters (I do know it does happen), the convenience API should keep the separator as char, not char[]. What you think?

Also including @sfilipi and @rogancarr. If they think multi-separator files are common I will desist, but otherwise I would prefer to keep the simple (and common) thing simple.

@TomFinley
Copy link
Contributor

TomFinley commented Feb 25, 2019

I don't quite agree. We have only one separator because all our text files are cleaned. Is this assumption applied to other places?

In reply to: 466606912 [](ancestors = 466606912)

I'm pretty sure. When I look back on DRI responses internally, I don't see the multi-separate capability used too often, even for user data. Sometimes it is relevant -- it was added for a reason -- but I wouldn't say commonly. Also its presence here significantly complicates the simpler scenario, since writing ',', or whatever is a lot simpler than writing new[] { ',' }.

@wschin
Copy link
Member Author

wschin commented Feb 25, 2019

Ok, so I will close this PR.


In reply to: 467074220 [](ancestors = 467074220)

@wschin wschin closed this Feb 25, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants