Skip to content

Towards #3204 - Documentation for MLContext.Transforms.Categorical #3388

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 21, 2019
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/Microsoft.ML.Data/Transforms/ColumnCopying.cs
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ public override SchemaShape GetOutputSchema(SchemaShape inputSchema)
}

/// <summary>
/// <see cref="ITransformer"/> resulting from fitting an <see cref="ColumnCopyingEstimator"/>.
/// <see cref="ITransformer"/> resulting from fitting a <see cref="ColumnCopyingEstimator"/>.
/// </summary>
public sealed class ColumnCopyingTransformer : OneToOneTransformerBase
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ public static class ConversionsExtensionsCatalog
/// </summary>
/// <param name="catalog">The conversion transform's catalog.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.
/// This column's data type will be a vector of <see cref="System.UInt32"/>, or a scalar <see cref="System.UInt32"/> based on whether the input column data types
/// This column's data type will be a vector of keys, or a scalar keys based on whether the input column data types
/// are vectors or scalars.</param>
/// <param name="inputColumnName">Name of the column whose data will be hashed.
/// If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.
Expand Down
2 changes: 1 addition & 1 deletion src/Microsoft.ML.Data/Transforms/Hashing.cs
Original file line number Diff line number Diff line change
Expand Up @@ -1113,7 +1113,7 @@ public override void Process()
/// | -- | -- |
/// | Does this estimator need to look at the data to train its parameters? | Yes, if the mapping of the hashes to the values is required. |
/// | Input column data type | Vector or scalars of numeric, boolean, [text](xref:Microsoft.ML.Data.TextDataViewType), [DateTime](xref: System.DateTime) and [key](xref:Microsoft.ML.Data.KeyDataViewType) data types.|
/// | Output column data type | Vector or scalar [System.Int32](xref:System.Int32).|
/// | Output column data type | Vector or scalar [key](xref:Microsoft.ML.Data.KeyDataViewType)|
///
/// ]]></format>
/// </remarks>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ namespace Microsoft.ML.Transforms
/// | | |
/// | -- | -- |
/// | Does this estimator need to look at the data to train its parameters? | Yes |
/// | Input column data type | Scalar numeric, boolean, [text](xref:Microsoft.ML.Data.TextDataViewType), [System.DateTime](xref:System.DateTime) or [key](xref:Microsoft.ML.Data.KeyDataViewType) data types.|
/// | Output column data type | [key](xref:Microsoft.ML.Data.KeyDataViewType)|
/// | Input column data type | Scalar or vector of numeric, boolean, [text](xref:Microsoft.ML.Data.TextDataViewType), [System.DateTime](xref:System.DateTime) and [key](xref:Microsoft.ML.Data.KeyDataViewType) data types.|
/// | Output column data type | Scalar or vector of [key](xref:Microsoft.ML.Data.KeyDataViewType)|
///
/// The ValueToKeyMappingEstimator builds up term vocabularies(dictionaries) mapping the input values to the keys on the dictionary.
/// If multiple columns are used, each column builds/uses exactly one vocabulary.
Expand Down
61 changes: 42 additions & 19 deletions src/Microsoft.ML.Transforms/CategoricalCatalog.cs
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,22 @@ namespace Microsoft.ML
public static class CategoricalCatalog
{
/// <summary>
/// Convert text columns into one-hot encoded vectors.
/// Create a <see cref="OneHotEncodingEstimator"/>, which converts the input column specified by <paramref name="inputColumnName"/>
/// into a column of one-hot encoded vectors named <paramref name="outputColumnName"/>.
/// </summary>
/// <param name="catalog">The transform catalog</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.</param>
/// <param name="inputColumnName">Name of column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>
/// <param name="outputKind">Output kind: Bag (multi-set vector), Ind (indicator vector), Key (index), or Binary encoded indicator vector.</param>
/// <param name="catalog">The transform catalog.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.
/// This column's data type will be a vector of <see cref="System.Single"/> if <paramref name="outputKind"/> is
/// <see cref="OneHotEncodingEstimator.OutputKind.Bag"/>, <see cref="OneHotEncodingEstimator.OutputKind.Indicator"/>, and <see cref="OneHotEncodingEstimator.OutputKind.Binary"/>.
/// If <paramref name="outputKind"/> is <see cref="OneHotEncodingEstimator.OutputKind.Key"/>, this column's data type will be a key in the case of a scalar input column
/// or a vector of keys in the case of a vector input column.</param>
/// <param name="inputColumnName">Name of column to convert to one-hot vectors. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/>
/// will be used as source. This column's data type can be scalar or vector of numeric, text, boolean, <see cref="System.DateTime"/> or <see cref="System.DateTimeOffset"/>,</param>
/// <param name="outputKind">Output kind: Bag (multi-set vector), Indicator (indicator vector), Key (index), or Binary encoded indicator vector.</param>
/// <param name="maximumNumberOfKeys">Maximum number of terms to keep per column when auto-training.</param>
/// <param name="keyOrdinality">How items should be ordered when vectorized. If <see cref="ValueToKeyMappingEstimator.KeyOrdinality.ByOccurrence"/> choosen they will be in the order encountered.
/// If <see cref="ValueToKeyMappingEstimator.KeyOrdinality.ByValue"/>, items are sorted according to their default comparison, for example, text sorting will be case sensitive (for example, 'A' then 'Z' then 'a').</param>
/// <param name="keyOrdinality">How items should be ordered when vectorized. If <see cref="ValueToKeyMappingEstimator.KeyOrdinality.ByOccurrence"/>
/// choosen they will be in the order encountered. If <see cref="ValueToKeyMappingEstimator.KeyOrdinality.ByValue"/>,
/// items are sorted according to their default comparison, for example, text sorting will be case sensitive (for example, 'A' then 'Z' then 'a').</param>
/// <param name="keyData">Specifies an ordering for the encoding. If specified, this should be a single column data view,
/// and the key-values will be taken from that column. If unspecified, the ordering will be determined from the input data upon fitting.</param>
/// <example>
Expand All @@ -43,14 +50,19 @@ public static OneHotEncodingEstimator OneHotEncoding(this TransformsCatalog.Cate
new[] { new OneHotEncodingEstimator.ColumnOptions(outputColumnName, inputColumnName, outputKind, maximumNumberOfKeys, keyOrdinality) }, keyData);

/// <summary>
/// Convert text columns into one-hot encoded vectors.
/// Create a <see cref="OneHotEncodingEstimator"/>, which converts one or more input text columns specified in <paramref name="columns"/>
/// into as many columns of one-hot encoded vectors.
/// </summary>
/// <param name="catalog">The transform catalog</param>
/// <param name="columns">Specifies the names of the columns on which to apply the transformation.</param>
/// <param name="catalog">The transform catalog.</param>
/// <param name="columns">The pairs of input and output columns. The output columns' data type will be a vector of <see cref="System.Single"/> if <paramref name="outputKind"/> is
/// <see cref="OneHotEncodingEstimator.OutputKind.Bag"/>, <see cref="OneHotEncodingEstimator.OutputKind.Indicator"/>, and <see cref="OneHotEncodingEstimator.OutputKind.Binary"/>.
/// If <paramref name="outputKind"/> is <see cref="OneHotEncodingEstimator.OutputKind.Key"/>, the output columns' data type will be a key in the case of scalar input column
/// or a vector of keys in the case of a vector input column.</param>
/// <param name="outputKind">Output kind: Bag (multi-set vector), Ind (indicator vector), Key (index), or Binary encoded indicator vector.</param>
/// <param name="maximumNumberOfKeys">Maximum number of terms to keep per column when auto-training.</param>
/// <param name="keyOrdinality">How items should be ordered when vectorized. If <see cref="ValueToKeyMappingEstimator.KeyOrdinality.ByOccurrence"/> choosen they will be in the order encountered.
/// If <see cref="ValueToKeyMappingEstimator.KeyOrdinality.ByValue"/>, items are sorted according to their default comparison, for example, text sorting will be case sensitive (for example, 'A' then 'Z' then 'a').</param>
/// <param name="keyOrdinality">How items should be ordered when vectorized. If <see cref="ValueToKeyMappingEstimator.KeyOrdinality.ByOccurrence"/>
/// choosen they will be in the order encountered. If <see cref="ValueToKeyMappingEstimator.KeyOrdinality.ByValue"/>,
/// items are sorted according to their default comparison, for example, text sorting will be case sensitive (for example, 'A' then 'Z' then 'a').</param>
/// <param name="keyData">Specifies an ordering for the encoding. If specified, this should be a single column data view,
/// and the key-values will be taken from that column. If unspecified, the ordering will be determined from the input data upon fitting.</param>
/// <example>
Expand Down Expand Up @@ -96,17 +108,24 @@ internal static OneHotEncodingEstimator OneHotEncoding(this TransformsCatalog.Ca
=> new OneHotEncodingEstimator(CatalogUtils.GetEnvironment(catalog), columns, keyData);

/// <summary>
/// Convert a text column into hash-based one-hot encoded vector.
/// Create a <see cref="OneHotHashEncodingEstimator"/>, which converts a text column specified by <paramref name="inputColumnName"/>
/// into a hash-based one-hot encoded vector column named <paramref name="outputColumnName"/>.
/// </summary>
/// <param name="catalog">The transform catalog</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.</param>
/// <param name="inputColumnName">Name of column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.
/// This column's data type will be a vector of <see cref="System.Single"/> if <paramref name="outputKind"/> is
/// <see cref="OneHotEncodingEstimator.OutputKind.Bag"/>, <see cref="OneHotEncodingEstimator.OutputKind.Indicator"/>, and <see cref="OneHotEncodingEstimator.OutputKind.Binary"/>.
/// If <paramref name="outputKind"/> is <see cref="OneHotEncodingEstimator.OutputKind.Key"/>, this column's data type will be a key in the case of a scalar input column
/// or a vector of keys in the case of a vector input column.
/// <param name="inputColumnName">Name of column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.
/// This column's data type can be scalar or vector of numeric, text, boolean, <see cref="System.DateTime"/> or <see cref="System.DateTimeOffset"/>.</param>
/// <param name="outputKind">The conversion mode.</param>
/// <param name="numberOfBits">Number of bits to hash into. Must be between 1 and 30, inclusive.</param>
/// <param name="seed">Hashing seed.</param>
/// <param name="useOrderedHashing">Whether the position of each term should be included in the hash.</param>
/// <param name="maximumNumberOfInverts">During hashing we constuct mappings between original values and the produced hash values.
/// Text representation of original values are stored in the slot names of the metadata for the new column.Hashing, as such, can map many initial values to one.
/// Text representation of original values are stored in the slot names of the metadata for the new column.Hashing,
/// as such, can map many initial values to one.</param>
/// <paramref name="maximumNumberOfInverts"/> specifies the upper bound of the number of distinct input values mapping to a hash that should be retained.
/// <value>0</value> does not retain any input values. <value>-1</value> retains all input values mapping to each hash.</param>
/// <example>
Expand All @@ -127,16 +146,20 @@ public static OneHotHashEncodingEstimator OneHotHashEncoding(this TransformsCata
new[] { new OneHotHashEncodingEstimator.ColumnOptions(outputColumnName, inputColumnName, outputKind, numberOfBits, seed, useOrderedHashing, maximumNumberOfInverts) });

/// <summary>
/// Convert text columns into hash-based one-hot encoded vector columns.
/// Create a <see cref="OneHotHashEncodingEstimator"/>, which converts one or more input text columns specified by <paramref name="columns"/>
/// into as many columns of hash-based one-hot encoded vectors.
/// </summary>
/// <param name="catalog">The transform catalog</param>
/// <param name="columns">Specifies the names of the columns on which to apply the transformation.</param>
/// <param name="columns">The pairs of input and output columns. The output columns' data type will be a vector of <see cref="System.Single"/> if <paramref name="outputKind"/> is
/// <see cref="OneHotEncodingEstimator.OutputKind.Bag"/>, <see cref="OneHotEncodingEstimator.OutputKind.Indicator"/>, and <see cref="OneHotEncodingEstimator.OutputKind.Binary"/>.
/// If <paramref name="outputKind"/> is <see cref="OneHotEncodingEstimator.OutputKind.Key"/>, the output columns' data type will be a key in the case of scalar input column
/// or a vector of keys in the case of a vector input column.</param>
/// <param name="outputKind">The conversion mode.</param>
/// <param name="numberOfBits">Number of bits to hash into. Must be between 1 and 30, inclusive.</param>
/// <param name="seed">Hashing seed.</param>
/// <param name="useOrderedHashing">Whether the position of each term should be included in the hash.</param>
/// <param name="maximumNumberOfInverts">During hashing we constuct mappings between original values and the produced hash values.
/// Text representation of original values are stored in the slot names of the metadata for the new column.Hashing, as such, can map many initial values to one.
/// Text representation of original values are stored in the slot names of the metadata for the new column. Hashing, as such, can map many initial values to one.
/// <paramref name="maximumNumberOfInverts"/> specifies the upper bound of the number of distinct input values mapping to a hash that should be retained.
/// <value>0</value> does not retain any input values. <value>-1</value> retains all input values mapping to each hash.</param>
/// <example>
Expand Down
53 changes: 51 additions & 2 deletions src/Microsoft.ML.Transforms/OneHotEncoding.cs
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,9 @@

namespace Microsoft.ML.Transforms
{
/// <include file='doc.xml' path='doc/members/member[@name="CategoricalOneHotVectorizer"]/*' />
/// <summary>
/// <see cref="ITransformer"/> resulting from fitting a <see cref="OneHotEncodingEstimator"/>.
/// </summary>
public sealed class OneHotEncodingTransformer : ITransformer
{
internal sealed class Column : ValueToKeyMappingTransformer.ColumnBase
Expand Down Expand Up @@ -141,8 +143,55 @@ internal OneHotEncodingTransformer(ValueToKeyMappingEstimator term, IEstimator<I
IRowToRowMapper ITransformer.GetRowToRowMapper(DataViewSchema inputSchema) => ((ITransformer)_transformer).GetRowToRowMapper(inputSchema);
}
/// <summary>
/// Estimator which takes set of columns and produce for each column indicator array.
/// Converts one or more input columns of categorical values into as many ourput columns of one-hot encoded vectors.
/// </summary>
/// <remarks>
/// <format type="text/markdown"><![CDATA[
///
/// ### Estimator Characteristics
/// | | |
/// | -- | -- |
/// | Does this estimator need to look at the data to train its parameters? | Yes |
/// | Input column data type | Vector or scalar of numeric, boolean, [text](xref:Microsoft.ML.Data.TextDataViewType), <xref:System.DateTime> or [key](xref:Microsoft.ML.Data.KeyDataViewType) |
/// | Output column data type | Scalar or vector of [key](xref:Microsoft.ML.Data.KeyDataViewType), or vector of <xref:System.Single> |
///
/// The <xref:Microsoft.ML.Transforms.OneHotEncodingEstimator> builds a dictionary of unique values appearing in the input column.
/// The resulting <xref:Microsoft.ML.Transforms.OneHotEncodingTransformer> converts one or more input columns into as many output
/// columns of one-hot encoded vectors.
///
/// The <xref:Microsoft.ML.Transforms.OneHotEncodingEstimator> is often used to convert categorical data into a form that can be
/// provided to a machine learning algorithm.
///
/// The output of this transform is specified by <xref:Microsoft.ML.Transforms.OneHotEncodingEstimator.OutputKind>:
///
/// - <xref:Microsoft.ML.Transforms.OneHotEncodingEstimator.OutputKind.Indicator> produces an [indicator vector](https://en.wikipedia.org/wiki/Indicator_vector).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • [](start = 8, length = 1)

just checking: is this valid for markdown lists? I've always used *

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both produce the same bullet

/// Each slot in this vector corresponds to a category in the dictionary, so its length is the size of the built dictionary.
/// If a value is not found in the dictioray, the output is the zero vector.
///
/// - <xref:Microsoft.ML.Transforms.OneHotEncodingEstimator.OutputKind.Bag> produces one vector such that each slot stores the number
/// of occurances of the corresponding value in the input vector.
/// Each slot in this vector corresponds to a value in the dictionary, so its length is the size of the built dictionary.
/// <xref:Microsoft.ML.Transforms.OneHotEncodingEstimator.OutputKind.Indicator> and <xref:Microsoft.ML.Transforms.OneHotEncodingEstimator.OutputKind.Bag>
/// differ simply in how the bit-vectors generated from individual slots in the input column are aggregated:
/// for Indicator they are concatenated and for Bag they are added. When the source column is a Scalar, the Indicator and Bag options are identical.
///
/// - <xref:Microsoft.ML.Transforms.OneHotEncodingEstimator.OutputKind.Key> produces keys in a <xref:Microsoft.ML.Data.KeyDataViewType> column.
/// If the input column is a vector, the output contains a vectory [keys](xref:Microsoft.ML.Data.KeyDataViewType), where each slot of the
/// vector corresponds to the respective slot of the input vector.
/// If a category is not found in the bulit dictionary, it is assigned the value zero.
Copy link
Member

@sfilipi sfilipi Apr 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zero [](start = 85, length = 4)

suggest adding: which represents the key missing from the dictionary.

///
/// - <xref:Microsoft.ML.Transforms.OneHotEncodingEstimator.OutputKind.Binary> produces a binary encoded vector to represent the values found in the dictionary
/// that are present in the input column. If a value in the input column is not found in the dictionary, the output is the zero vector.
///
/// The OneHotEncodingTransformer can be applied to one or more columns, in which case it builds and uses a separate dictionary
/// for each column that it is applied to.
///
/// See the See Also section for links to examples of the usage.
/// ]]>
/// </format>
/// </remarks>
/// <seealso cref="CategoricalCatalog.OneHotEncoding(TransformsCatalog.CategoricalTransforms, InputOutputColumnPair[], OneHotEncodingEstimator.OutputKind, int, ValueToKeyMappingEstimator.KeyOrdinality, IDataView)"/>
/// <seealso cref="CategoricalCatalog.OneHotEncoding(TransformsCatalog.CategoricalTransforms, string, string, OneHotEncodingEstimator.OutputKind, int, ValueToKeyMappingEstimator.KeyOrdinality, IDataView)"/>
public sealed class OneHotEncodingEstimator : IEstimator<OneHotEncodingTransformer>
{
[BestFriend]
Expand Down
Loading