-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Adding documentation about entry points, and entry points graphs: EntryPoints.md and GraphRunner.md #295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding documentation about entry points, and entry points graphs: EntryPoints.md and GraphRunner.md #295
Changes from 3 commits
5da49a3
63b9fe8
73cb7c8
a962fe9
12d3537
cca4f43
55174f3
14a727c
e9b3a11
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,295 @@ | ||
# Entry Points And Helper Classes | ||
|
||
## Overview | ||
|
||
An 'entry point', is a representation of a ML.NET type in JSON format. Entry points are used to serialize and deserialize an ML.NET type in JSON. | ||
It is also the recommended way to interface with other languages. | ||
Defined based on entry points, experiments are entry points DAGs, and respectively, entry points are experiment graph nodes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Could this be rephrased? I'm not quite sure what it is mean to express. Experiments #Closed |
||
That's why through the documentaiton, we also refer to them as 'entry points nodes'. | ||
The graph 'variables', the various values of the experiment graph JSON properties serve to describe the relationship between the entry point nodes. | ||
The 'variables' are therefore the edges of the DAG. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Introduce the acronym "directed acyclic graph" #Resolved |
||
|
||
All of ML.NET entry points are described by their manifest. The manifest is another JSON object that documents and describes the structure of an entry points. | ||
Manifests are referenced to understand what an entry point does, and how it should be constructed, in a graph. | ||
|
||
This document briefly describes the structure of the entry points, the structure of an entry point manifest, and mentions the ML.NET classes that help construct an entry point | ||
graph. | ||
|
||
## EntryPoint manifest - the definition of an entry point | ||
|
||
An example of an entry point manifest object, specifically for the MissingValueIndicator transform, is: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Consider using code formatting for class names. #Resolved |
||
|
||
```javascript | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is how it is actually written out, but I wonder if we could just format it a bit to make it a bit more tolerable. The document is dominated by this ~180 line monstrosity. I think it could be improved significantly by just deleting a bunch of whitespace... so for example if the stuff from lines 40 through 65, we could make it look more like this to save a bunch of lines. "Values": ["I1", "U1", "I2", "U2", "I4", "U4", "I8", "U8",
"R4", "Num", R8", "TX", "Text", "TXT", "BL", "Bool",
"TimeSpan", "TS", "DT", DateTime", "DZ", "DateTimeZone",
"UG", "U16"] Basically I suppose I'd say if it looked more like someone actually wrote it vs. code-generated it would be a lot easier to appreciate and comprehend. I think we can get it to all fit on one page. Sometimes more lengthy cannot be helped, but in general and especially for the first example, I think it's important that it fit on one page. #Closed |
||
{ | ||
"Name": "Transforms.ColumnTypeConverter", | ||
"Desc": "Converts a column to a different type, using standard conversions.", | ||
"FriendlyName": "Convert Transform", | ||
"ShortName": "Convert", | ||
"Inputs": [ | ||
{ | ||
"Name": "Column", | ||
"Type": { | ||
"Kind": "Array", | ||
"ItemType": { | ||
"Kind": "Struct", | ||
"Fields": [ | ||
{ | ||
"Name": "ResultType", | ||
"Type": { | ||
"Kind": "Enum", | ||
"Values": [ | ||
"I1", | ||
"U1", | ||
"I2", | ||
"U2", | ||
"I4", | ||
"U4", | ||
"I8", | ||
"U8", | ||
"R4", | ||
"Num", | ||
"R8", | ||
"TX", | ||
"Text", | ||
"TXT", | ||
"BL", | ||
"Bool", | ||
"TimeSpan", | ||
"TS", | ||
"DT", | ||
"DateTime", | ||
"DZ", | ||
"DateTimeZone", | ||
"UG", | ||
"U16" | ||
] | ||
}, | ||
"Desc": "The result type", | ||
"Aliases": [ | ||
"type" | ||
], | ||
"Required": false, | ||
"SortOrder": 150, | ||
"IsNullable": true, | ||
"Default": null | ||
}, | ||
{ | ||
"Name": "Range", | ||
"Type": "String", | ||
"Desc": "For a key column, this defines the range of values", | ||
"Aliases": [ | ||
"key" | ||
], | ||
"Required": false, | ||
"SortOrder": 150, | ||
"IsNullable": false, | ||
"Default": null | ||
}, | ||
{ | ||
"Name": "Name", | ||
"Type": "String", | ||
"Desc": "Name of the new column", | ||
"Aliases": [ | ||
"name" | ||
], | ||
"Required": false, | ||
"SortOrder": 150, | ||
"IsNullable": false, | ||
"Default": null | ||
}, | ||
{ | ||
"Name": "Source", | ||
"Type": "String", | ||
"Desc": "Name of the source column", | ||
"Aliases": [ | ||
"src" | ||
], | ||
"Required": false, | ||
"SortOrder": 150, | ||
"IsNullable": false, | ||
"Default": null | ||
} | ||
] | ||
} | ||
}, | ||
"Desc": "New column definition(s) (optional form: name:type:src)", | ||
"Aliases": [ | ||
"col" | ||
], | ||
"Required": true, | ||
"SortOrder": 1, | ||
"IsNullable": false | ||
}, | ||
{ | ||
"Name": "Data", | ||
"Type": "DataView", | ||
"Desc": "Input dataset", | ||
"Required": true, | ||
"SortOrder": 2, | ||
"IsNullable": false | ||
}, | ||
{ | ||
"Name": "ResultType", | ||
"Type": { | ||
"Kind": "Enum", | ||
"Values": [ | ||
"I1", | ||
"U1", | ||
"I2", | ||
"U2", | ||
"I4", | ||
"U4", | ||
"I8", | ||
"U8", | ||
"R4", | ||
"Num", | ||
"R8", | ||
"TX", | ||
"Text", | ||
"TXT", | ||
"BL", | ||
"Bool", | ||
"TimeSpan", | ||
"TS", | ||
"DT", | ||
"DateTime", | ||
"DZ", | ||
"DateTimeZone", | ||
"UG", | ||
"U16" | ||
] | ||
}, | ||
"Desc": "The result type", | ||
"Aliases": [ | ||
"type" | ||
], | ||
"Required": false, | ||
"SortOrder": 2, | ||
"IsNullable": true, | ||
"Default": null | ||
}, | ||
{ | ||
"Name": "Range", | ||
"Type": "String", | ||
"Desc": "For a key column, this defines the range of values", | ||
"Aliases": [ | ||
"key" | ||
], | ||
"Required": false, | ||
"SortOrder": 150, | ||
"IsNullable": false, | ||
"Default": null | ||
} | ||
], | ||
"Outputs": [ | ||
{ | ||
"Name": "OutputData", | ||
"Type": "DataView", | ||
"Desc": "Transformed dataset" | ||
}, | ||
{ | ||
"Name": "Model", | ||
"Type": "TransformModel", | ||
"Desc": "Transform model" | ||
} | ||
], | ||
"InputKind": [ | ||
"ITransformInput" | ||
], | ||
"OutputKind": [ | ||
"ITransformOutput" | ||
] | ||
} | ||
``` | ||
|
||
The respective entry point, constructed based on this manifest would be: | ||
|
||
```javascript | ||
{ | ||
"Name": "Transforms.ColumnTypeConverter", | ||
"Inputs": { | ||
"Column": [ | ||
{ | ||
"Name": "Features", | ||
"Source": "Features" | ||
} | ||
], | ||
"Data": "$data0", | ||
"ResultType": "R4" | ||
}, | ||
"Outputs": { | ||
"OutputData": "$Convert_Output", | ||
"Model": "$Convert_TransformModel" | ||
} | ||
} | ||
``` | ||
|
||
## `EntryPointGraph` | ||
|
||
This class encapsulates the list of nodes (`EntryPointNode`) and edges | ||
(`EntryPointVariable` inside a `RunContext`) of the graph. | ||
|
||
## `EntryPointNode` | ||
|
||
This class represents a node in the graph, and wraps an entry point call. It | ||
has methods for creating and running entry points. It also has a reference to | ||
the `RunContext` to allow it to get and set values from `EntryPointVariable`s. | ||
|
||
To express the inputs that are set through variables, a set of dictionaries | ||
are used. The `InputBindingMap` maps an input parameter name to a list of | ||
`ParameterBinding`s. The `InputMap` maps a `ParameterBinding` to a | ||
`VariableBinding`. For example, if the JSON looks like this: | ||
|
||
```javascript | ||
'foo': '$bar' | ||
``` | ||
|
||
the `InputBindingMap` will have one entry that maps the string "foo" to a list | ||
that has only one element, a `SimpleParameterBinding` with the name "foo" and | ||
the `InputMap` will map the `SimpleParameterBinding` to a | ||
`SimpleVariableBinding` with the name "bar". For a more complicated example, | ||
let's say we have this JSON: | ||
|
||
```javascript | ||
'foo': [ '$bar[3]', '$baz'] | ||
``` | ||
|
||
the `InputBindingMap` will have one entry that maps the string "foo" to a list | ||
that has two elements, an `ArrayIndexParameterBinding` with the name "foo" and | ||
index 0 and another one with index 1. The `InputMap` will map the first | ||
`ArrayIndexParameterBinding` to an `ArrayIndexVariableBinding` with name "bar" | ||
and index 3 and the second `ArrayIndexParameterBinding` to a | ||
`SimpleVariableBinding` with the name "baz". | ||
|
||
For outputs, a node assumes that an output is mapped to a variable, so the | ||
`OutputMap` is a simple dictionary from string to string. | ||
|
||
## `EntryPointVariable` | ||
|
||
This class represents an edge in the entry point graph. It has a name, a type | ||
and a value. Variables can be simple, arrays and/or dictionaries. Currently, | ||
only data views, file handles, predictor models and transform models are | ||
allowed as element types for a variable. | ||
|
||
## `RunContext` | ||
|
||
This class is just a container for all the variables in a graph. | ||
|
||
## `VariableBinding` and Derived Classes | ||
|
||
The abstract base class represents a "pointer to a (part of a) variable". It | ||
is used in conjunction with `ParameterBinding`s to specify inputs to an entry | ||
point node. The `SimpleVariableBinding` is a pointer to an entire variable, | ||
the `ArrayIndexVariableBinding` is a pointer to a specific index in an array | ||
variable, and the `DictionaryKeyVariableBinding` is a pointer to a specific | ||
key in a dictionary variable. | ||
|
||
## `ParameterBinding` and Derived Classes | ||
|
||
The abstract base class represents a "pointer to a (part of a) parameter". It | ||
parallels the `VariableBinding` hierarchy and it is used to specify the inputs | ||
to an entry point node. The `SimpleParameterBinding` is a pointer to a | ||
non-array, non-dictionary parameter, the `ArrayIndexParameterBinding` is a | ||
pointer to a specific index of an array parameter and the | ||
`DictionaryKeyParameterBinding` is a pointer to a specific key of a dictionary | ||
parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not entirely enthusiastic about that description. I think the primary reason why I don't like it is, I think the phrase ML.NET type is misleading, or at least vague. If I were asked what an ML.NET type is I might say something like
VBuffer
orIDataView
, and to me a representation as JSON makes me think that thing is being serialized, which is not the point of entry-points at all.So maybe, we could replace a lot of this language with something like this (I don't insist on this exact wording):