Adding EntryPoints.md and GraphRunner.md

sfilipi · sfilipi · commit 5da49a3ed75a · 2018-06-04T11:45:45.000-07:00
diff --git a/docs/code/EntryPoints.md b/docs/code/EntryPoints.md
@@ -0,0 +1,188 @@
+﻿# Overview
+
+An 'entry point', is a representation of a ML.Net type in json format and it is used to serialize and deserialize an ML.Net type in JSON. 
+It is also one of the ways ML.Net uses to deserialize experiments, and the recommended way to interface with other languages. 
+In terms defining experiments w.r.t entry points, experiments are entry points DAGs, and respectively, entry points are experiment graph nodes.
+That's why through the documentaiton, we also refer to them as 'entry points nodes'.
+The graph 'variables', the various values of the experiemnt graph json properties serve to describe the relationship between the entry point nodes. 
+The 'variables' are therefore the edges of the DAG. 
+
+All of ML.Net entry points are described by their manifest. The manifest is another json object that documents and describes the structure of an entry points. 
+Manifests are referenced to understand what an entry point does, and how it should be constructed, in a graph.  
+
+This document briefly describes the structure of the entry points, the structure of an entry point manifest, and mentions the ML.Net classes that help construct an entry point
+graph.
+
+## `EntryPoint manifest - the definition of an entry point`
+
+An example of an entry point manifest object, specifically for the MissingValueIndicator transform, is:
+
+```javascript
+    {
+      "Name": "Transforms.MissingValueIndicator",
+      "Desc": "Create a boolean output column with the same number of slots as the input column, where the output value is true if the value in the input column is missing.",
+      "FriendlyName": "NA Indicator Transform",
+      "ShortName": "NAInd",
+      "Inputs": [
+        {
+          "Name": "Column",
+          "Type": {
+            "Kind": "Array",
+            "ItemType": {
+              "Kind": "Struct",
+              "Fields": [
+                {
+                  "Name": "Name",
+                  "Type": "String",
+                  "Desc": "Name of the new column",
+                  "Aliases": [
+                    "name"
+                  ],
+                  "Required": false,
+                  "SortOrder": 150.0,
+                  "IsNullable": false,
+                  "Default": null
+                },
+                {
+                  "Name": "Source",
+                  "Type": "String",
+                  "Desc": "Name of the source column",
+                  "Aliases": [
+                    "src"
+                  ],
+                  "Required": false,
+                  "SortOrder": 150.0,
+                  "IsNullable": false,
+                  "Default": null
+                }
+              ]
+            }
+          },
+          "Desc": "New column definition(s) (optional form: name:src)",
+          "Aliases": [
+            "col"
+          ],
+          "Required": true,
+          "SortOrder": 1.0,
+          "IsNullable": false
+        },
+        {
+          "Name": "Data",
+          "Type": "DataView",
+          "Desc": "Input dataset",
+          "Required": true,
+          "SortOrder": 1.0,
+          "IsNullable": false
+        }
+      ],
+      "Outputs": [
+        {
+          "Name": "OutputData",
+          "Type": "DataView",
+          "Desc": "Transformed dataset"
+        },
+        {
+          "Name": "Model",
+          "Type": "TransformModel",
+          "Desc": "Transform model"
+        }
+      ],
+      "InputKind": [
+        "ITransformInput"
+      ],
+      "OutputKind": [
+        "ITransformOutput"
+      ]
+    }
+```
+
+The respective entry point, constructed based on this manifest would be:
+
+```javascript
+	{
+		"Name": "Transforms.MissingValueIndicator",
+		"Inputs": {
+			"Column": [
+				{
+					"Name": "Features",
+					"Source": "Features"
+				}
+			],
+			"Data": "$data0"
+		},
+		"Outputs": {
+			"OutputData": "$Output_1528136517433",
+			"Model": "$TransformModel_1528136517433"
+		}
+	}
+```
+
+## `EntryPointGraph`
+
+This class encapsulates the list of nodes (`EntryPointNode`) and edges
+(`EntryPointVariable` inside a `RunContext`) of the graph.
+
+## `EntryPointNode`
+
+This class represents a node in the graph, and wraps an entry point call. It
+has methods for creating and running entry points. It also has a reference to
+the `RunContext` to allow it to get and set values from `EntryPointVariable`s.
+
+To express the inputs that are set through variables, a set of dictionaries
+are used. The `InputBindingMap` maps an input parameter name to a list of
+`ParameterBinding`s. The `InputMap` maps a `ParameterBinding` to a
+`VariableBinding`.  For example, if the JSON looks like this:
+
+```javascript
+'foo': '$bar'
+```
+
+the `InputBindingMap` will have one entry that maps the string "foo" to a list
+that has only one element, a `SimpleParameterBinding` with the name "foo" and
+the `InputMap` will map the `SimpleParameterBinding` to a
+`SimpleVariableBinding` with the name "bar". For a more complicated example,
+let's say we have this JSON:
+
+```javascript
+'foo': [ '$bar[3]', '$baz']
+```
+
+the `InputBindingMap` will have one entry that maps the string "foo" to a list
+that has two elements, an `ArrayIndexParameterBinding` with the name "foo" and
+index 0 and another one with index 1. The `InputMap` will map the first
+`ArrayIndexParameterBinding` to an `ArrayIndexVariableBinding` with name "bar"
+and index 3 and the second `ArrayIndexParameterBinding` to a
+`SimpleVariableBinding` with the name "baz".
+
+For outputs, a node assumes that an output is mapped to a variable, so the
+`OutputMap` is a simple dictionary from string to string.
+
+## `EntryPointVariable`
+
+This class represents an edge in the entry point graph. It has a name, a type
+and a value. Variables can be simple, arrays and/or dictionaries. Currently,
+only data views, file handles, predictor models and transform models are
+allowed as element types for a variable.
+
+## `RunContext`
+
+This class is just a container for all the variables in a graph.
+
+## VariableBinding and Derived Classes
+
+The abstract base class represents a "pointer to a (part of a) variable". It
+is used in conjunction with `ParameterBinding`s to specify inputs to an entry
+point node. The `SimpleVariableBinding` is a pointer to an entire variable,
+the `ArrayIndexVariableBinding` is a pointer to a specific index in an array
+variable, and the `DictionaryKeyVariableBinding` is a pointer to a specific
+key in a dictionary variable.
+
+## ParameterBinding and Derived Classes
+
+The abstract base class represents a "pointer to a (part of a) parameter". It
+parallels the `VariableBinding` hierarchy and it is used to specify the inputs
+to an entry point node. The `SimpleParameterBinding` is a pointer to a
+non-array, non-dictionary parameter, the `ArrayIndexParameterBinding` is a
+pointer to a specific index of an array parameter and the
+`DictionaryKeyParameterBinding` is a pointer to a specific key of a dictionary
+parameter.
diff --git a/docs/code/GraphRunner.md b/docs/code/GraphRunner.md
@@ -0,0 +1,123 @@
+﻿# JSON Graph format
+
+The entry point graph in TLC is an array of _nodes_. Each node is an object with the following fields:
+
+- _name_: string. Required. Name of the entry point.
+- _inputs_: object. Optional. Specifies non-default inputs to the entry point. 
+Note that if the entry point has required inputs (which is very common), the _inputs_ field is requred.
+- _outputs_: object. Optional. Specifies the variables that will hold the node's outputs.
+
+## Input and output types
+The following types are supported in JSON graphs:
+
+- _string_. Represented as a JSON string, maps to a C# string.
+- _float_. Represented as a JSON float, maps to a C# float or double.
+- _bool_. Represented as a JSON bool, maps to a C# bool.
+- _enum_. Represented as a JSON string, maps to a C# enum. The allowed values are those of the C# enum (they are also listed in the manifest).
+- _int_. Currently not implemented. Represented as a JSON integer, maps to a C# int or long.
+- _array_ of the above. Represented as a JSON array, maps to a C# array.
+- _dictionary_. Currently not implemented. Represented as a JSON object, maps to a C# `Dictionary<string,T>`.
+- _component_. Currently not implemented. Represented as a JSON object with 2 fields: _name_:string and _settings_:object.
+
+## Variables
+The following input/output types can not be represented as a JSON value:
+- _DataView_
+- _FileHandle_
+- _TransformModel_
+- _PredictorModel_
+
+These must be passed as _variables_. The variable is represented as a JSON string that begins with "$". 
+Note the following rules:
+
+- A variable can appear in the _outputs_ only once per graph. That is, the variable can be 'assigned' only once. 
+- If the variable is present in _inputs_ of one node and in the _outputs_ of another node, this signifies the graph 'edge'. 
+The same variable can participate in many edges.
+- If the variable is present only in _inputs_, but never in _outputs_, it is a _graph input_. All graph inputs must be provided before
+a graph can be run.
+- The variable has a type, which is the type of inputs (and, optionally, output) that it appears in. If the type of the variable is 
+ambiguous, TLC throws an exception.
+- Circular references. The experiment graph is expected to be a DAG. If the circular dependency is detected, TLC throws an exception. 
+_Currently, this is done lazily: if we couldn't ever run a node because it's waiting for inputs, we throw._
+
+### Variables for arrays and dictionaries.
+It is allowed to define variables for arrays and dictionaries, as long as the item types are valid variable types (the four types listed above).
+They are treated the same way as regular 'scalar' variables.
+
+If we want to reference an item of the collection, we can use the `[]` syntax:
+- `$var[5]` denotes 5th element of an array variable.
+- `$var[foo]` and `$var['foo']` both denote the element with key 'foo' of a dictionary variable.
+_This is not yet implemented._
+
+Conversely, if we want to build a collection (array or dictionary) of variables, we can do it using JSON arrays and objects:
+- `["$v1", "$v2", "$v3"]` denotes an array containing 3 variables.
+- `{"foo": "$v1", "bar": "$v2"}` denotes a collection containing 2 key-value pairs.
+_This is also not yet implemented._
+
+## Example of a JSON entry point manifest object, and the respective entry point graph node
+Let's consider the following manifest snippet, describing an entry point _'CVSplit.Split'_:
+```
+    {
+      "name": "CVSplit.Split",
+      "desc": "Split the dataset into the specified number of cross-validation folds (train and test sets)",
+      "inputs": [
+        {
+          "name": "Data",
+          "type": "DataView",
+          "desc": "Input dataset",
+          "required": true
+        },
+        {
+          "name": "NumFolds",
+          "type": "Int",
+          "desc": "Number of folds to split into",
+          "required": false,
+          "default": 2
+        },
+        {
+          "name": "StratificationColumn",
+          "type": "String",
+          "desc": "Stratification column",
+          "aliases": [
+            "strat"
+          ],
+          "required": false,
+          "default": null
+        }
+      ],
+      "outputs": [
+        {
+          "name": "TrainData",
+          "type": {
+            "kind": "Array",
+            "itemType": "DataView"
+          },
+          "desc": "Training data (one dataset per fold)"
+        },
+        {
+          "name": "TestData",
+          "type": {
+            "kind": "Array",
+            "itemType": "DataView"
+          },
+          "desc": "Testing data (one dataset per fold)"
+        }
+      ]
+    }
+```
+
+As we can see, the entry point has 3 inputs (one of them required), and 2 outputs.
+The following is a correct graph containing call to this entry point:
+```
+{
+  "nodes": [
+    {
+      "name": "CVSplit.Split",
+      "inputs": {
+        "Data": "$data1"
+      },
+      "outputs": {
+        "TrainData": "$cv"
+      }
+    }]
+}
+```