Skip to content

Adding Shuffle to the catalog #2427

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Feb 6, 2019
Merged

Conversation

rogancarr
Copy link
Contributor

This PR adds Shuffle to the catalog, along with a sample. It also marks RowShufflingTransform internal.

Fixes #342

@codecov
Copy link

codecov bot commented Feb 6, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@d1a2962). Click here to learn what that means.
The diff coverage is 0%.

@@            Coverage Diff            @@
##             master    #2427   +/-   ##
=========================================
  Coverage          ?   71.26%           
=========================================
  Files             ?      785           
  Lines             ?   140960           
  Branches          ?    16110           
=========================================
  Hits              ?   100462           
  Misses            ?    36026           
  Partials          ?     4472
Flag Coverage Δ
#Debug 71.26% <0%> (?)
#production 67.62% <0%> (?)
#test 85.31% <ø> (?)

namespace Microsoft.ML.Samples.Dynamic
{
/// <summary>
/// Sample class showing how to use <see cref="DataOperationsCatalog.Shuffle"/>.
Copy link
Member

@sfilipi sfilipi Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[](start = 40, length = 43)

this won't create a link, since it will be embeeded in the code section of the site; if you want a link you can paste it as a link.

not a big deal, but In general, i don;t think i see XML style comments in the API reference samples; it is just simple comments.
Example: https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/keywords/break #Resolved

/// A pool of <paramref name="poolRows"/> rows will be constructed from the first <paramref name="poolRows"/> rows
/// in the dataset. Rows will then be randomly yielded from the pool and replaced with the next row in the dataset
/// until all the rows have been yielded.
/// If the <paramref name="input"/> is shufflable, then it will read into the pool with a shuffled cursor.
Copy link
Member

@sfilipi sfilipi Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cursor [](start = 107, length = 6)

maybe cref as well? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I've seen in the docs, we use paramref for parameters.


In reply to: 254149833 [](ancestors = 254149833)

{
PoolRows = poolRows,
PoolOnly = poolOnly,
ForceShuffle = true,
Copy link
Member

@sfilipi sfilipi Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true [](start = 31, length = 4)

shall this be a argument of Shuffle? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are generally trying to remove parameters from the input, and this is a confusing one to explain, as it really gets into the internals of RowShufflingTransform. I can get all the same behavior out of the RowShufflingTransform by setting this to true and tuning the others, so I'd prefer to leave it out.


In reply to: 254150113 [](ancestors = 254150113)

Copy link
Member

@sfilipi sfilipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

/// Shuffle the rows of <paramref name="input"/>.
/// </summary>
/// <remarks>
/// <see cref="Shuffle"/> will shuffle the rows of any input <see cref="IDataView"/> by using a streaming pool.
Copy link
Contributor

@artidoro artidoro Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pool [](start = 114, length = 4)

Maybe it's just my ignorance of English, but would it be possible to further explain what we mean by pool? I had opened the dictionary to see that it meant group. Maybe changing to "A group of <..." #Resolved

/// </example>
public IDataView Shuffle(IDataView input,
uint? seed = null,
int poolRows = RowShufflingTransformer.Defaults.PoolRows,
Copy link
Contributor

@artidoro artidoro Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

poolRows [](start = 16, length = 8)

Could we rename this parameter to numPoolRows? I think it would help make the doc above more easy to read #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about shufflePoolSize?


In reply to: 254187809 [](ancestors = 254187809)

/// A pool of <paramref name="poolRows"/> rows will be constructed from the first <paramref name="poolRows"/> rows
/// in the dataset. Rows will then be randomly yielded from the pool and replaced with the next row in the dataset
/// until all the rows have been yielded.
/// If the <paramref name="input"/> is shufflable, then it will read into the pool with a shuffled cursor.
Copy link
Contributor

@artidoro artidoro Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// If the is shufflable, then it will read into the pool with a shuffled cursor. [](start = 8, length = 106)

I am sorry but what does it mean for the input to be shufflable? When would it not be? Also it might be best not to talk about cursors unless it is necessary I am afraid it might lead to quite a bit of confusion for people who are used to thinking of numpy arrays. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this more explicit by calling out the IDV CanShuffle property. The docs will now link to IDV and CanShuffle.


In reply to: 254189367 [](ancestors = 254189367)


/// <summary>
/// Shuffle the rows of <paramref name="input"/>.
/// </summary>
Copy link
Contributor

@artidoro artidoro Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the below comment I don't understand if the output IDataView will be of size poolRows or if it will be of the same size as the input IDataView. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about now? I rewrote the remarks to be a bit more explicit on what was happening.


In reply to: 254190313 [](ancestors = 254190313)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very clear now thanks for explaining further!


In reply to: 254337002 [](ancestors = 254337002,254190313)

bool shuffleSource = !RowShufflingTransformer.Defaults.PoolOnly)
{
Environment.CheckValue(input, nameof(input));
Environment.CheckUserArg(shufflePoolSize > 0, nameof(shufflePoolSize), "Pool size must be positive");
Copy link
Contributor

@TomFinley TomFinley Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Pool size must be positive" [](start = 83, length = 28)

Prefer the simpler message "Must be positive." That you're talking about the parameter shufflePoolSize is already established by the prior nameof. #Resolved

PoolRows = shufflePoolSize,
PoolOnly = !shuffleSource,
ForceShuffle = true,
ForceShuffleSeed = (int)seed
Copy link
Contributor

@TomFinley TomFinley Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(int)seed [](start = 35, length = 9)

If this is left null, which is the default value, this cast will throw an exception. I am pretty sure that you meant to do something like seed ?? ...get random seed from Environment... or somesuch. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also changed the parameter to BootstrapSample to be a int? and cast to uint?.


In reply to: 254346714 [](ancestors = 254346714)

/// <format type="text/markdown">
/// <![CDATA[
/// [!code-csharp[Shuffle](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/DataOperations/Shuffle.cs)]
/// ]]>
Copy link
Contributor

@TomFinley TomFinley Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually works? That's fantastic if so. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it pulls the examples into the docs.microsoft.com pages. Pretty cool.


In reply to: 254346894 [](ancestors = 254346894)

/// <param name="seed">The random seed. If unspecified, the random state will be instead derived from the <see cref="MLContext"/>.</param>
/// <param name="shufflePoolSize">The number of rows to hold in the pool. Setting this to 1 will turn off pool shuffling and
/// <see cref="Shuffle"/> will only perform a shuffle by reading <paramref name="input"/> in a random order.</param>
/// <param name="shuffleSource">If false, the transform will not attempt to read <paramref name="input"/> in a random order and only use
Copy link
Contributor

@TomFinley TomFinley Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

false [](start = 43, length = 5)

I think we've received feedback from .NET docs team that things like this ought to be <see langword="false" />. #Resolved

{
public const int PoolRows = 1000;
public const bool PoolOnly = false;
public const bool ForceShuffle = false;
}

public sealed class Options
[BestFriend]
internal sealed class Options
Copy link
Contributor

@TomFinley TomFinley Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we revert this to public? This is already in an internal class, so this provides no additional information. #Resolved

@@ -95,7 +97,8 @@ private static VersionInfo GetVersionInfo()
/// <param name="poolRows">The pool will have this many rows</param>
/// <param name="poolOnly">If true, the transform will not attempt to shuffle the input cursor but only shuffle based on the pool. This parameter has no effect if the input data was not itself shufflable.</param>
/// <param name="forceShuffle">If true, the transform will always provide a shuffled view.</param>
public RowShufflingTransformer(IHostEnvironment env,
[BestFriend]
internal RowShufflingTransformer(IHostEnvironment env,
Copy link
Contributor

@TomFinley TomFinley Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal [](start = 8, length = 8)

Similar notes. there is no point in internalizing these. #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we had a drive to internalize constructors of ITransformer implementations, not IDataTransform implementations. (The goal there is to just internalize them altogether, as we see in this case has already been done.)


In reply to: 254349247 [](ancestors = 254349247)

@@ -149,7 +153,8 @@ private RowShufflingTransformer(IHost host, ModelLoadContext ctx, IDataView inpu
_subsetInput = SelectCachableColumns(input, host);
}

public static RowShufflingTransformer Create(IHostEnvironment env, ModelLoadContext ctx, IDataView input)
[BestFriend]
internal static RowShufflingTransformer Create(IHostEnvironment env, ModelLoadContext ctx, IDataView input)
Copy link
Contributor

@TomFinley TomFinley Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal [](start = 8, length = 8)

similar #Resolved

@@ -222,7 +227,8 @@ internal static bool CanShuffleAll(Schema schema)
/// <summary>
/// Utility to take a cursor, and get a shuffled version of this cursor.
/// </summary>
public static RowCursor GetShuffledCursor(IChannelProvider provider, int poolRows, RowCursor cursor, Random rand)
[BestFriend]
internal static RowCursor GetShuffledCursor(IChannelProvider provider, int poolRows, RowCursor cursor, Random rand)
Copy link
Contributor

@TomFinley TomFinley Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal [](start = 8, length = 8)

similar #Resolved

Copy link
Contributor

@artidoro artidoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@rogancarr rogancarr merged commit 29348f9 into dotnet:master Feb 6, 2019
@rogancarr rogancarr deleted the 342_shuffle branch February 6, 2019 20:02
@ghost ghost locked as resolved and limited conversation to collaborators Mar 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants