-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Schema comprehension doc #572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@dotnet-bot test Linux Release |
@Zruty0 - we are in the middle of moving our CI system from Jenkins to VSTS. You can ignore those 2 failed runs. (Plus you are just modifying .md files anyway.) #Resolved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this helpful document, @Zruty0. It looks really good.
docs/code/SchemaComprehension.md
Outdated
|
||
### Streaming data views | ||
|
||
What if the original data doesn't support seeking, kile if it's some form of `IEnumerable<IrisData>` instead of `IList<IrisData>`? Well, we can simply use another helper function: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(type-o) kile
#Resolved
docs/code/SchemaComprehension.md
Outdated
|
||
Let's see how we can create a new `IDataView` out of an in-memory array, run some operations on it, and then read it back into the array. | ||
|
||
```(csharp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what exactly about your string isn't working, but I don't get syntax highlighting when viewing the document.
Typically, I use the format ```C#
instead. #Resolved
docs/code/SchemaComprehension.md
Outdated
Below are the most notable examples of the differences: | ||
|
||
* `IDataView` vector columns may have a fixed (and known) size, C# arrays can not. You can use `[VectorType(N)]` attribute to an array field to specify that the column is a vector of fixed size N. This is often necessary: most ML components don't work with variable-size vectors, they require fixed-size ones. | ||
* `IDataView`'s **key types** don't have an underlying C# type either. To declare a key-type column, you need to make your field an `uint`, and decorate it with `[KeyType(Min=A, Count=B)]` to denote that the field is a key with the specified range of values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes it sound like Min
and Count
are required values, but they do not appear to be. (And I'm assuming there are plenty of scenarios where a user doesn't know up front what are all the possible values. #Resolved
docs/code/SchemaComprehension.md
Outdated
var predictionEngine = env.CreatePredictionEngine<IrisData, IrisVectorData>(dv, outputSchemaDefinition: schemaDef); | ||
``` | ||
|
||
In addition to the above, you can use `SchemaDefinition` to add per-column metadata, or even a 'value generator' (so that the column value is not read from the field, but computed using a delegate). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be interesting to have a code snippet example for this scenario? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I think the 'generator' bit didn't make it into ML.NET
In reply to: 204536819 [](ancestors = 204536819)
docs/code/SchemaComprehension.md
Outdated
* Reading a different subset of columns on every row: the cursor always populates the entire row object. | ||
* Reading column metadata from the data view. | ||
* Accessing the 'hidden' data view columns by index. | ||
* Creating 'cursor sets'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A link or definition of cursor sets
may be helpful here. #Resolved
I hope they will still go away, because they block the merging. In reply to: 407177294 [](ancestors = 407177294) |
@@ -0,0 +1,210 @@ | |||
# Schema comprehension in ML.NET | |||
|
|||
This document describes in detail the under-the-hood mechanism that ML.NET uses to automate the creation of `IDataView` schema, with the goal to make it as convenient to the end user as possible, while not incurring extra computational costs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDataView` schem [](start = 109, length = 16)
Might be useful to link to the IDV doc. #Closed
docs/code/SchemaComprehension.md
Outdated
|
||
## Introduction | ||
|
||
Every dataset in ML.NET is an `IDataView`, which is, for the purposes of this document, a collection of rows that share the same columns. The set of columns, their names, types and other metadata is known as the *schema* of the `IDataView`, and it's represented as an `ISchema` object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is an [](start = 24, length = 5)
would it be more clear if it says "gets loaded into an IDV" #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think it's more correct to say 'is represented as', because you don't necessarily LOAD a dataset.
In reply to: 204819663 [](ancestors = 204819663)
docs/code/SchemaComprehension.md
Outdated
|
||
## Introduction | ||
|
||
Every dataset in ML.NET is an `IDataView`, which is, for the purposes of this document, a collection of rows that share the same columns. The set of columns, their names, types and other metadata is known as the *schema* of the `IDataView`, and it's represented as an `ISchema` object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
schema [](start = 212, length = 9)
link to the schema section of the IDV Design Principles #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These items above are very similar to the definition of fields in a C# class: names and types of columns correspond to names and types of fields, and metadata can correspond to field attributes. | ||
Because of this similarity, ML.NET offers a common convenient mechanism for creating a schema: it is done via defining a C# class. | ||
|
||
For example, the below class definition can be used to define a data view with 5 float columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data view [](start = 64, length = 9)
wondering if it would help to state at the beginning that IDataView and 'data view' are interchangeable, because you give the definition of one, and use the other term for it. #Closed
docs/code/SchemaComprehension.md
Outdated
.ToArray(); | ||
} | ||
``` | ||
After this code runs, `arr` will contain two `IrisVectorData` objects, each having `Features` filled with the actual values of the features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
features [](start = 131, length = 8)
I'd add (the 4 concatenated columns) after features, to make it more explicit. #Closed
docs/code/SchemaComprehension.md
Outdated
```(csharp) | ||
var streamingDv = env.CreateStreamingDataView<IrisData>(dataEnumerable); | ||
``` | ||
The only subtle difference is, the resulting `streamingDv` will not support shuffling (a property that's useful to some ML application). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shuffling [](start = 76, length = 9)
Maybe link to what data shuffling is.
#Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs/code/SchemaComprehension.md
Outdated
`IDataView` [type system](IDataViewTypeSystem.md) differs slightly from the C# type system, so a 1-1 mapping between column types and C# types is not always feasible. | ||
Below are the most notable examples of the differences: | ||
|
||
* `IDataView` vector columns may have a fixed (and known) size, C# arrays can not. You can use `[VectorType(N)]` attribute to an array field to specify that the column is a vector of fixed size N. This is often necessary: most ML components don't work with variable-size vectors, they require fixed-size ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C# arrays can not [](start = 64, length = 17)
this might get confusing if you think about initialized arrays. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs/code/SchemaComprehension.md
Outdated
Below are the most notable examples of the differences: | ||
|
||
* `IDataView` vector columns may have a fixed (and known) size, C# arrays can not. You can use `[VectorType(N)]` attribute to an array field to specify that the column is a vector of fixed size N. This is often necessary: most ML components don't work with variable-size vectors, they require fixed-size ones. | ||
* `IDataView`'s **key types** don't have an underlying C# type either. To declare a key-type column, you need to make your field an `uint`, and decorate it with `[KeyType(Min=A, Count=B)]` to denote that the field is a key with the specified range of values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*key types [](start = 16, length = 12)
link maybe #Closed
| `BL` | `DvBool` | `bool`, `bool?` | | ||
| `TS` | `DvTimeSpan` | | | ||
| `DT` | `DvDateTime` | | | ||
| `DZ` | `DvDateTimeZone` | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They don't map to TimeSpan, DateTime and DataTimeZone? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was our design decision to not allow these scenarios, thus simplifying the other, more common scenarios. | ||
|
||
Here is the list of things that are only possible via the low-level interface: | ||
* Creating or reading a data view, where even column *types* are not known at compile time (so you cannot create a C# class to define the schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating or reading a data view, where even column types are not known at compile time (so you cannot create a C# class to define the schema) [](start = 2, length = 143)
example of scenario when this might occur #Closed
docs/code/SchemaComprehension.md
Outdated
* Creating or reading a data view, where even column *types* are not known at compile time (so you cannot create a C# class to define the schema) | ||
* Reading a different subset of columns on every row: the cursor always populates the entire row object. | ||
* Reading column metadata from the data view. | ||
* Accessing the 'hidden' data view columns by index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'hidden' data view column [](start = 16, length = 25)
link or define "hidden" #Closed
docs/code/SchemaComprehension.md
Outdated
|
||
Here is the list of things that are only possible via the low-level interface: | ||
* Creating or reading a data view, where even column *types* are not known at compile time (so you cannot create a C# class to define the schema) | ||
* Reading a different subset of columns on every row: the cursor always populates the entire row object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
different subset of columns on every row [](start = 12, length = 40)
what does 'different' mean here? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Added a doc for schema comprehension
* Added a doc for schema comprehension
Added a document that describe typed schema comprehension.
Fixes #554