-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Adding Shuffle to the catalog #2427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… adding a sample.
Codecov Report
@@ Coverage Diff @@
## master #2427 +/- ##
=========================================
Coverage ? 71.26%
=========================================
Files ? 785
Lines ? 140960
Branches ? 16110
=========================================
Hits ? 100462
Misses ? 36026
Partials ? 4472
|
namespace Microsoft.ML.Samples.Dynamic | ||
{ | ||
/// <summary> | ||
/// Sample class showing how to use <see cref="DataOperationsCatalog.Shuffle"/>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[](start = 40, length = 43)
this won't create a link, since it will be embeeded in the code section of the site; if you want a link you can paste it as a link.
not a big deal, but In general, i don;t think i see XML style comments in the API reference samples; it is just simple comments.
Example: https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/keywords/break #Resolved
/// A pool of <paramref name="poolRows"/> rows will be constructed from the first <paramref name="poolRows"/> rows | ||
/// in the dataset. Rows will then be randomly yielded from the pool and replaced with the next row in the dataset | ||
/// until all the rows have been yielded. | ||
/// If the <paramref name="input"/> is shufflable, then it will read into the pool with a shuffled cursor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cursor [](start = 107, length = 6)
maybe cref as well? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I've seen in the docs, we use paramref for parameters.
In reply to: 254149833 [](ancestors = 254149833)
{ | ||
PoolRows = poolRows, | ||
PoolOnly = poolOnly, | ||
ForceShuffle = true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true [](start = 31, length = 4)
shall this be a argument of Shuffle? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are generally trying to remove parameters from the input, and this is a confusing one to explain, as it really gets into the internals of RowShufflingTransform. I can get all the same behavior out of the RowShufflingTransform by setting this to true and tuning the others, so I'd prefer to leave it out.
In reply to: 254150113 [](ancestors = 254150113)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Shuffle the rows of <paramref name="input"/>. | ||
/// </summary> | ||
/// <remarks> | ||
/// <see cref="Shuffle"/> will shuffle the rows of any input <see cref="IDataView"/> by using a streaming pool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pool [](start = 114, length = 4)
Maybe it's just my ignorance of English, but would it be possible to further explain what we mean by pool? I had opened the dictionary to see that it meant group. Maybe changing to "A group of <..." #Resolved
/// </example> | ||
public IDataView Shuffle(IDataView input, | ||
uint? seed = null, | ||
int poolRows = RowShufflingTransformer.Defaults.PoolRows, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
poolRows [](start = 16, length = 8)
Could we rename this parameter to numPoolRows
? I think it would help make the doc above more easy to read #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// A pool of <paramref name="poolRows"/> rows will be constructed from the first <paramref name="poolRows"/> rows | ||
/// in the dataset. Rows will then be randomly yielded from the pool and replaced with the next row in the dataset | ||
/// until all the rows have been yielded. | ||
/// If the <paramref name="input"/> is shufflable, then it will read into the pool with a shuffled cursor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// If the is shufflable, then it will read into the pool with a shuffled cursor. [](start = 8, length = 106)
I am sorry but what does it mean for the input to be shufflable? When would it not be? Also it might be best not to talk about cursors unless it is necessary I am afraid it might lead to quite a bit of confusion for people who are used to thinking of numpy arrays. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this more explicit by calling out the IDV CanShuffle property. The docs will now link to IDV and CanShuffle.
In reply to: 254189367 [](ancestors = 254189367)
|
||
/// <summary> | ||
/// Shuffle the rows of <paramref name="input"/>. | ||
/// </summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the below comment I don't understand if the output IDataView
will be of size poolRows
or if it will be of the same size as the input IDataView
. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about now? I rewrote the remarks to be a bit more explicit on what was happening.
In reply to: 254190313 [](ancestors = 254190313)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very clear now thanks for explaining further!
In reply to: 254337002 [](ancestors = 254337002,254190313)
bool shuffleSource = !RowShufflingTransformer.Defaults.PoolOnly) | ||
{ | ||
Environment.CheckValue(input, nameof(input)); | ||
Environment.CheckUserArg(shufflePoolSize > 0, nameof(shufflePoolSize), "Pool size must be positive"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Pool size must be positive" [](start = 83, length = 28)
Prefer the simpler message "Must be positive."
That you're talking about the parameter shufflePoolSize
is already established by the prior nameof
. #Resolved
PoolRows = shufflePoolSize, | ||
PoolOnly = !shuffleSource, | ||
ForceShuffle = true, | ||
ForceShuffleSeed = (int)seed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(int)seed [](start = 35, length = 9)
If this is left null
, which is the default value, this cast will throw an exception. I am pretty sure that you meant to do something like seed ?? ...get random seed from Environment...
or somesuch. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also changed the parameter to BootstrapSample to be a int?
and cast to uint?
.
In reply to: 254346714 [](ancestors = 254346714)
/// <format type="text/markdown"> | ||
/// <] | ||
/// ]]> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually works? That's fantastic if so. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it pulls the examples into the docs.microsoft.com pages. Pretty cool.
In reply to: 254346894 [](ancestors = 254346894)
/// <param name="seed">The random seed. If unspecified, the random state will be instead derived from the <see cref="MLContext"/>.</param> | ||
/// <param name="shufflePoolSize">The number of rows to hold in the pool. Setting this to 1 will turn off pool shuffling and | ||
/// <see cref="Shuffle"/> will only perform a shuffle by reading <paramref name="input"/> in a random order.</param> | ||
/// <param name="shuffleSource">If false, the transform will not attempt to read <paramref name="input"/> in a random order and only use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
false [](start = 43, length = 5)
I think we've received feedback from .NET docs team that things like this ought to be <see langword="false" />
. #Resolved
{ | ||
public const int PoolRows = 1000; | ||
public const bool PoolOnly = false; | ||
public const bool ForceShuffle = false; | ||
} | ||
|
||
public sealed class Options | ||
[BestFriend] | ||
internal sealed class Options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we revert this to public? This is already in an internal class, so this provides no additional information. #Resolved
@@ -95,7 +97,8 @@ private static VersionInfo GetVersionInfo() | |||
/// <param name="poolRows">The pool will have this many rows</param> | |||
/// <param name="poolOnly">If true, the transform will not attempt to shuffle the input cursor but only shuffle based on the pool. This parameter has no effect if the input data was not itself shufflable.</param> | |||
/// <param name="forceShuffle">If true, the transform will always provide a shuffled view.</param> | |||
public RowShufflingTransformer(IHostEnvironment env, | |||
[BestFriend] | |||
internal RowShufflingTransformer(IHostEnvironment env, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal [](start = 8, length = 8)
Similar notes. there is no point in internalizing these. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we had a drive to internalize constructors of ITransformer
implementations, not IDataTransform
implementations. (The goal there is to just internalize them altogether, as we see in this case has already been done.)
In reply to: 254349247 [](ancestors = 254349247)
@@ -149,7 +153,8 @@ private RowShufflingTransformer(IHost host, ModelLoadContext ctx, IDataView inpu | |||
_subsetInput = SelectCachableColumns(input, host); | |||
} | |||
|
|||
public static RowShufflingTransformer Create(IHostEnvironment env, ModelLoadContext ctx, IDataView input) | |||
[BestFriend] | |||
internal static RowShufflingTransformer Create(IHostEnvironment env, ModelLoadContext ctx, IDataView input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal [](start = 8, length = 8)
similar #Resolved
@@ -222,7 +227,8 @@ internal static bool CanShuffleAll(Schema schema) | |||
/// <summary> | |||
/// Utility to take a cursor, and get a shuffled version of this cursor. | |||
/// </summary> | |||
public static RowCursor GetShuffledCursor(IChannelProvider provider, int poolRows, RowCursor cursor, Random rand) | |||
[BestFriend] | |||
internal static RowCursor GetShuffledCursor(IChannelProvider provider, int poolRows, RowCursor cursor, Random rand) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal [](start = 8, length = 8)
similar #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR adds
Shuffle
to the catalog, along with a sample. It also marksRowShufflingTransform
internal
.Fixes #342