|
| 1 | +# ML.NET high-level concepts |
| 2 | + |
| 3 | +In this document, we give a brief overview of the ML.NET high-level concepts. This document is mainly intended to describe the *model training* scenarios in ML.NET, since not all these concepts are relevant for the more simple scenario of *prediction with existing model*. |
| 4 | + |
| 5 | +## List of high-level concepts |
| 6 | + |
| 7 | +This document is going to cover the following ML.NET concepts: |
| 8 | + |
| 9 | +- *Data*, represented as an `IDataView` interface. |
| 10 | + - In ML.NET, data is very similar to a SQL view: it's a lazily-evaluated, immutable, cursorable, heterogenous, schematized dataset. |
| 11 | + - An excellent document about the data interface is [IDataView Design Principles](IDataViewDesignPrinciples.md). |
| 12 | +- *Transformer*, represented as `ITransformer` interface. |
| 13 | + - In one sentence, a transformer is a component that takes data, does some work on it, and return new 'transformed' data. |
| 14 | + - For example, you can think of a machine learning model as a transformer that takes features and returns predictions. |
| 15 | + - Another example, 'text tokenizer' would take a single text column and output a vector column with individual 'words' extracted out of the texts. |
| 16 | +- *Data reader*, represented as an `IDataReader<T>` interface. |
| 17 | + - The data reader is ML.NET component to 'create' data: it takes an instance of `T` and returns data out of it. |
| 18 | + - For example, a *TextLoader* is an `IDataReader<FileSource>`: it takes the file source and produces data. |
| 19 | +- *Estimator*, represented as an `IEstimator<T>` interface. |
| 20 | + - This is an object that learns from data. The result of the learning is a *transformer*. |
| 21 | + - You can think of a machine learning *algorithm* as an estimator that learns on data and produces a machine learning *model* (which is a transformer). |
| 22 | +- *Prediction function*, represented as a `PredictionFunction<TSrc, TDst>` class. |
| 23 | + - The prediction function can be seen as a machine that applies a transformer to one 'row', such as at prediction time. |
| 24 | + |
| 25 | +## Data |
| 26 | + |
| 27 | +In ML.NET, data is very similar to a SQL view: it's a lazily-evaluated, cursorable, heterogenous, schematized dataset. |
| 28 | + |
| 29 | +- It has *Schema* (an instance of an `ISchema` interface), that contains the information about the data view's columns. |
| 30 | + - Each column has a *Name*, a *Type*, and an arbitrary set of *metadata* associated with it. |
| 31 | + - It is important to note that one of the types is the `vector<T, N>` type, which means that the column's values are *vectors of items of type T, with the size of N*. This is a recommended way to represent multi-dimensional data associated with every row, like pixels in an image, or tokens in a text. |
| 32 | + - The column's *metadata* contains information like 'slot names' of a vector column and suchlike. The metadata itself is actually represented as another one-row *data*, that is unique to each column. |
| 33 | +- The data view is a source of *cursors*. Think SQL cursors: a cursor is an object that iterates through the data, one row at a time, and presents the available data. |
| 34 | + - Naturally, data can have as many active cursors over it as needed: since data itself is immutable, cursors are truly independent. |
| 35 | + - Note that cursors typically access only a subset of columns: for efficiency, we do not compute the values of columns that are not 'needed' by the cursor. |
| 36 | + |
| 37 | +## Transformer |
| 38 | + |
| 39 | +A transformer is a component that takes data, does some work on it, and return new 'transformed' data. |
| 40 | + |
| 41 | +Here's the interface of `ITransformer`: |
| 42 | +```c# |
| 43 | +public interface ITransformer |
| 44 | +{ |
| 45 | + IDataView Transform(IDataView input); |
| 46 | + ISchema GetOutputSchema(ISchema inputSchema); |
| 47 | +} |
| 48 | +``` |
| 49 | + |
| 50 | +As you can see, the transformer can `Transform` an input data to produce the output data. The other method, `GetOutputSchema`, is a mechanism of *schema propagation*: it allows you to see how the output data will look like for a given shape of the input data without actually performing the transformation. |
| 51 | + |
| 52 | +Most transformers in ML.NET tend to operate on one *input column* at a time, and produce the *output column*. For example a `new HashTransformer("foo", "bar")` would take the values from column "foo", hash them and put them into column "bar". |
| 53 | + |
| 54 | +It is also common that the input and output column names are the same. In this case, the old column is 'replaced' with the new one. For example, a `new HashTransformer("foo")` would take the values from column "foo", hash them and 'put them back' into "foo". |
| 55 | + |
| 56 | +Any transformer will, of course, produce a new data view when `Transform` is called: remember, data views are immutable. |
| 57 | + |
| 58 | +Another important consideration is that, because data is lazily evaluated, *transformers are lazy too*. Essentially, after you call |
| 59 | +```c# |
| 60 | +var newData = transformer.Transform(oldData) |
| 61 | +``` |
| 62 | +no actual computation will happen: only after you get a cursor from `newData` and start consuming the value will `newData` invoke the `transformer`'s transformation logic (and even that only if `transformer` in question is actually needed to produce the requested columns). |
| 63 | + |
| 64 | +### Transformer chains |
| 65 | + |
| 66 | +A useful property of a transformer is that *you can phrase a sequential application of transformers as yet another transformer*: |
| 67 | + |
| 68 | +```c# |
| 69 | +var fullTransformer = transformer1.Append(transformer2).Append(transformer3); |
| 70 | +``` |
| 71 | + |
| 72 | +We utilize this property a lot in ML.NET: typically, the trained ML.NET model is a 'chain of transformers', which is, for all intents and purposes, a *transformer*. |
| 73 | + |
| 74 | +## Data reader |
| 75 | + |
| 76 | +The data reader is ML.NET component to 'create' data: it takes an instance of `T` and returns data out of it. |
| 77 | + |
| 78 | +Here's the exact interface of `IDataReader<T>`: |
| 79 | +```c# |
| 80 | +public interface IDataReader<in TSource> |
| 81 | +{ |
| 82 | + IDataView Read(TSource input); |
| 83 | + ISchema GetOutputSchema(); |
| 84 | +} |
| 85 | +``` |
| 86 | +As you can see, the reader is capable of reading data (potentially multiple times, and from different 'inputs'), but the resulting data will always have the same schema, denoted by `GetOutputSchema`. |
| 87 | + |
| 88 | +An interesting property to note is that you can create a new data reader by 'attaching' a transformer to an existing data reader. This way you can have 'reader' with transformation behavior baked in: |
| 89 | +```c# |
| 90 | +var newReader = reader.Append(transformer1).Append(transformer2) |
| 91 | +``` |
| 92 | + |
| 93 | +Another similarity to transformers is that, since data is lazily evaluated, *readers are lazy*: no (or minimal) actual 'reading' happens when you call `dataReader.Read()`: only when a cursor is requested on the resulting data does the reader begin to work. |
| 94 | + |
| 95 | +## Estimator |
| 96 | + |
| 97 | +The *estimator* is an object that learns from data. The result of the learning is a *transformer*. |
| 98 | +Here is the interface of `IEstimator<T>`: |
| 99 | +```c# |
| 100 | +public interface IEstimator<out TTransformer> |
| 101 | + where TTransformer : ITransformer |
| 102 | +{ |
| 103 | + TTransformer Fit(IDataView input); |
| 104 | + SchemaShape GetOutputSchema(SchemaShape inputSchema); |
| 105 | +} |
| 106 | +``` |
| 107 | + |
| 108 | +You can easily imagine how *a sequence of estimators can be phrased as an estimator* of its own. In ML.NET, we rely on this property to create 'learning pipelines' that chain together different estimators: |
| 109 | + |
| 110 | +```c# |
| 111 | +var env = new LocalEnvironment(); // Initialize the ML.NET environment. |
| 112 | +var estimator = new ConcatEstimator(env, "Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth") |
| 113 | + .Append(new ToKeyEstimator(env, "Label")) |
| 114 | + .Append(new SdcaMultiClassTrainer(env, "Features", "Label")) // This is the actual 'machine learning algorithm'. |
| 115 | + .Append(new ToValueEstimator(env, "PredictedLabel")); |
| 116 | + |
| 117 | +var endToEndModel = estimator.Fit(data); // This now contains all the transformers that were used at training. |
| 118 | +``` |
| 119 | + |
| 120 | +One important property of estimators is that *estimators are eager, not lazy*: every call to `Fit` is causing 'learning' to happen, which is potentially a time-consuming operation. |
| 121 | + |
| 122 | +## Prediction function |
| 123 | + |
| 124 | +The prediction function can be seen as a machine that applies a transformer to one 'row', such as at prediction time. |
| 125 | + |
| 126 | +Once we obtain the model (which is a *transformer* that we either trained via `Fit()`, or loaded from somewhere), we can use it to make 'predictions' using the normal calls to `model.Transform(data)`. However, when we use this model in a real life scenario, we often don't have a whole 'batch' of examples to predict on. Instead, we have one example at a time, and we need to make timely predictions on them immediately. |
| 127 | + |
| 128 | +Of course, we can reduce this to the batch prediction: |
| 129 | +- Create a data view with exactly one row. |
| 130 | +- Call `model.Transform(data)` to obtain the 'predicted data view'. |
| 131 | +- Get a cursor over the resulting data. |
| 132 | +- Advance the cursor one step to get to the first (and only) row. |
| 133 | +- Extract the predicted values out of it. |
| 134 | + |
| 135 | +The above algorithm can be implemented using the [schema comprehension](SchemaComprehension.md), with two user-defined objects `InputExample` and `OutputPrediction` as follows: |
| 136 | + |
| 137 | +```c# |
| 138 | +var inputData = env.CreateDataView(new InputExample[] { example }); |
| 139 | +var outputData = model.Transform(inputData); |
| 140 | +var output = outputData.AsEnumerable<OutputPrediction>(env, reuseRowObject: false).Single(); |
| 141 | +``` |
| 142 | + |
| 143 | +But this would be cumbersome, and would incur performance costs. |
| 144 | +Instead, we have a 'prediction function' object that performs the same work, but faster and more convenient, via an extension method `MakePredictionFunction`: |
| 145 | + |
| 146 | +```c# |
| 147 | +var predictionFunc = model.MakePredictionFunction<InputExample, OutputPrediction>(env); |
| 148 | +var output = predictionFunc.Predict(example); |
| 149 | +``` |
| 150 | + |
| 151 | +The same `predictionFunc` can (and should!) be used multiple times, thus amortizing the initial cost of `MakePredictionFunction` call. |
| 152 | + |
| 153 | +The prediction function is *not re-entrant / thread-safe*: if you want to conduct predictions simultaneously with multiple threads, you need to have a prediction function per thread. |
0 commit comments