|
| 1 | +# Schema comprehension in ML.NET |
| 2 | + |
| 3 | +This document describes in detail the under-the-hood mechanism that ML.NET uses to automate the creation of `IDataView` schema, with the goal to make it as convenient to the end user as possible, while not incurring extra computational costs. |
| 4 | + |
| 5 | +For a better understanding of `IDataView` principles and type system please refer to: |
| 6 | +* [IDataView Design Principles](IDataViewDesignPrinciples.md) |
| 7 | +* [IDataView Type System](IDataViewTypeSystem.md) |
| 8 | + |
| 9 | +## Introduction |
| 10 | + |
| 11 | +Every dataset in ML.NET is represented as an `IDataView`, which is, for the purposes of this document, a collection of rows that share the same columns. The set of columns, their names, types and other metadata is known as the *schema* of the `IDataView`, and it's represented as an `ISchema` object. |
| 12 | + |
| 13 | +In this document, we will be using the terms *data view* and `IDataView` interchangeably, same for *schema* and `ISchema`. |
| 14 | + |
| 15 | +Before any new data enters ML.NET, the user needs to somehow define how the schema of the data will look like. |
| 16 | +To do this, the following questions need to be answered: |
| 17 | +- What are the column names? |
| 18 | +- What are their types? |
| 19 | +- What other metadata is associated with the columns? |
| 20 | + |
| 21 | +These items above are very similar to the definition of fields in a C# class: names and types of columns correspond to names and types of fields, and metadata can correspond to field attributes. |
| 22 | +Because of this similarity, ML.NET offers a common convenient mechanism for creating a schema: it is done via defining a C# class. |
| 23 | + |
| 24 | +For example, the below class definition can be used to define a data view with 5 float columns: |
| 25 | +```C# |
| 26 | +public class IrisData |
| 27 | +{ |
| 28 | + public float Label; |
| 29 | + public float SepalLength; |
| 30 | + public float SepalWidth; |
| 31 | + public float PetalLength; |
| 32 | + public float PetalWidth; |
| 33 | +} |
| 34 | +``` |
| 35 | + |
| 36 | +## Using schema comprehension to make a data view and to read a data view |
| 37 | + |
| 38 | +The first obvious benefit of schema comprehension is that we can now create `IDataView`s out of in-memory enumerables of user-defined 'data types', without having to define the schema. |
| 39 | +It works in the other direction too: you can take an `IDataView`, and read it as an `IEnumerable` of user-defined 'data type' (which will fail if the user-provided schema does not match the real schema). |
| 40 | + |
| 41 | +Let's see how we can create a new `IDataView` out of an in-memory array, run some operations on it, and then read it back into the array. |
| 42 | + |
| 43 | +```C# |
| 44 | +public class IrisData |
| 45 | +{ |
| 46 | + public float Label; |
| 47 | + public float SepalLength; |
| 48 | + public float SepalWidth; |
| 49 | + public float PetalLength; |
| 50 | + public float PetalWidth; |
| 51 | +} |
| 52 | + |
| 53 | +public class IrisVectorData |
| 54 | +{ |
| 55 | + public float Label; |
| 56 | + public float[] Features; |
| 57 | +} |
| 58 | + |
| 59 | +static void Main(string[] args) |
| 60 | +{ |
| 61 | + // Here's a data array that we want to work on. |
| 62 | + var dataArray = new[] { |
| 63 | + new IrisData{Label=1, PetalLength=1, SepalLength=1, PetalWidth=1, SepalWidth=1}, |
| 64 | + new IrisData{Label=0, PetalLength=2, SepalLength=2, PetalWidth=2, SepalWidth=2} |
| 65 | + }; |
| 66 | + |
| 67 | + // Create the ML.NET environment. |
| 68 | + var env = new Microsoft.ML.Runtime.Data.TlcEnvironment(); |
| 69 | + |
| 70 | + // Create the data view. |
| 71 | + // This method will use the definition of IrisData to understand what columns there are in the |
| 72 | + // data view. |
| 73 | + var dv = env.CreateDataView<IrisData>(dataArray); |
| 74 | + |
| 75 | + // Now let's do something to the data view. For example, concatenate all four non-label columns |
| 76 | + // into 'Features' column. |
| 77 | + dv = new Microsoft.ML.Runtime.Data.ConcatTransform(env, dv, "Features", |
| 78 | + "SepalLength", "SepalWidth", "PetalLength", "PetalWidth"); |
| 79 | + |
| 80 | + // Read the data into an another array, this time we read the 'Features' and 'Label' columns |
| 81 | + // of the data, and ignore the rest. |
| 82 | + // This method will use the definition of IrisVectorData to understand which columns and of which types |
| 83 | + // are expected to be present in the input data. |
| 84 | + var arr = dv.AsEnumerable<IrisVectorData>(env, reuseRowObject: false) |
| 85 | + .ToArray(); |
| 86 | +} |
| 87 | +``` |
| 88 | +After this code runs, `arr` will contain two `IrisVectorData` objects, each having `Features` filled with the actual values of the features (the 4 concatenated columns). |
| 89 | + |
| 90 | +### Streaming data views |
| 91 | + |
| 92 | +What if the original data doesn't support seeking, like if it's some form of `IEnumerable<IrisData>` instead of `IList<IrisData>`? Well, we can simply use another helper function: |
| 93 | +```C# |
| 94 | +var streamingDv = env.CreateStreamingDataView<IrisData>(dataEnumerable); |
| 95 | +``` |
| 96 | +The only subtle difference is, the resulting `streamingDv` will not support shuffling (a property that's useful to some ML application). |
| 97 | + |
| 98 | +### AsCursorable and reuseRowObject parameter |
| 99 | + |
| 100 | +When you read a data view as `AsEnumerable<OutType>`, ML.NET will create and populate an object per row. If you do not need multiple row objects to exist in memory (for example, you are writing them to disk one by one, as you scan through the `IEnumerable`), you may want to set `reuseRowObject` to `true`. This will make ML.NET create *only one row object for the entire data view* when you enumerate it, and just re-populate the values every time. |
| 101 | + |
| 102 | +Obviously, in the example above this would lead to incorrect behavior, as the `arr` variable will hold two copies of the same `IrisVectorData` object. Please consider carefully whether you want to reuse the row object, because it is more efficient, but can lead to hard to find issues. |
| 103 | + |
| 104 | +Sometimes, we don't even want to *populate* the row object per row. For example, we only want to see every 100th row of the data, so there's no need to populate the remaining 99% row objects. In this case, you can use `AsCursorable<OutType>` method: |
| 105 | + |
| 106 | +```C# |
| 107 | +var cursorable = dv.AsCursorable<IrisVectorData>(env); |
| 108 | +// You can create as many simultaneous cursors as you like, they are independent. |
| 109 | +using (var cursor = cursorable.GetCursor()) |
| 110 | +{ |
| 111 | + // We are now in charge of creating the row object. |
| 112 | + var myRow = new IrisVectorData(); |
| 113 | + while (cursor.MoveNext()) |
| 114 | + { |
| 115 | + if (cursor.Position % 100 == 99) |
| 116 | + { |
| 117 | + // Populate the values of the row object. |
| 118 | + cursor.FillValues(myRow); |
| 119 | + // Do something to the row. |
| 120 | + } |
| 121 | + } |
| 122 | +} |
| 123 | +``` |
| 124 | +Please note that **cursors are not thread-safe**: they have mutable state inside, and they are meant to be used by one thread. If you want to read the data in parallel, use multiple cursors. |
| 125 | + |
| 126 | +## PredictionEngine and PredictorModel |
| 127 | + |
| 128 | +ML.NET's `PredictionEngine` is attempting to turn a sequence of data transforms (maybe capped by a predictor, but not necessarily) into a 'black box' that takes strongly typed inputs and returns strongly typed outputs. The name is a little misleading: the `PredictionEngine` object doesn't require a predictor to be present in the pipeline, it can be just a sequence of transforms like in the below example: |
| 129 | + |
| 130 | +```C# |
| 131 | +var engine = env.CreatePredictionEngine<IrisData, IrisVectorData>(dv); |
| 132 | +var output = engine.Predict(new IrisData { Label = 1, PetalLength = 1, SepalLength = 1, PetalWidth = 1, SepalWidth = 1 }); |
| 133 | +``` |
| 134 | +It is important to note that the `PredictionEngine` actually *validates* that the 'pipeline' conforms to the input and output schema requirements when it is created. |
| 135 | + |
| 136 | +The same can be said about the `PredictorModel<InputType, OutputType>`. This is a somewhat more restricted version of `PredictionEngine` that is created by `LearningPipeline.Train`. |
| 137 | + |
| 138 | +Please note that **`PredictionEngine` and `PredictorModel` are not thread-safe**: they hold an internal cursor object, and therefore cannot be used in a re-entrant fashion. |
| 139 | +If you ever see the error message that says: `An attempt was made to keep iterating after the pipe has been reset`, it most likely means that ML.NET has detected a race condition on the `PredictionEngine`. |
| 140 | + |
| 141 | +## Type system mapping |
| 142 | + |
| 143 | +`IDataView` [type system](IDataViewTypeSystem.md) differs slightly from the C# type system, so a 1-1 mapping between column types and C# types is not always feasible. |
| 144 | +Below are the most notable examples of the differences: |
| 145 | + |
| 146 | +* `IDataView` vector columns often have a fixed (and known) size. The C# array type best corresponds to a 'variable size' vector: the one that can have different number of slots on every row. You can use `[VectorType(N)]` attribute to an array field to specify that the column is a vector of fixed size N. This is often necessary: most ML components don't work with variable-size vectors, they require fixed-size ones. |
| 147 | +* `IDataView`'s [key types](IDataViewTypeSystem.md#key-types) don't have a natural underlying C# type either. To declare a key-type column, you need to make your field an `uint`, and decorate it with `[KeyType]` to denote that the field is a key, and not a regular unsigned integer. |
| 148 | + |
| 149 | +### Full list of type mappings |
| 150 | +The below table illustrates what C# types are mapped to what `IDataView` types: |
| 151 | + |
| 152 | +| `IDataView` type | C# type | C# type with extra conversion | |
| 153 | +| ---------------- | ----------- | ------------------------------ | |
| 154 | +| `I1` | `DvInt1` | `sbyte`, `sbyte?` | |
| 155 | +| `I2` | `DvInt2` | `short`, `short?` | |
| 156 | +| `I4` | `DvInt4` | `int`, `int?` | |
| 157 | +| `I8` | `DvInt8` | `long`, `long?` | |
| 158 | +| `U1` | `byte` | `byte?` | |
| 159 | +| `U2` | `ushort` | `ushort?` | |
| 160 | +| `U4` | `uint` | `uint?` | |
| 161 | +| `U8` | `ulong` | `ulong?` | |
| 162 | +| `UG` | `UInt128` | | |
| 163 | +| `R4` | `float` | `float?` | |
| 164 | +| `R8` | `double` | `double?` | |
| 165 | +| `TX` | `DvText`, `string` | | |
| 166 | +| `BL` | `DvBool` | `bool`, `bool?` | |
| 167 | +| `TS` | `DvTimeSpan` | | |
| 168 | +| `DT` | `DvDateTime` | | |
| 169 | +| `DZ` | `DvDateTimeZone` | | |
| 170 | +| Variable-size vector | `VBuffer<T>` | `T[]`, and the vector is always dense | |
| 171 | +| Fixed-size vector | `VBuffer<T>` with `[VectorType(N)]` | `T[]` with `VectorType(N)`, and the vector is always dense | |
| 172 | +| Key type | `uint` with `[KeyType]` | | |
| 173 | + |
| 174 | +### Additional attributes to affect type mapping |
| 175 | + |
| 176 | +There are two more attributes that can affect the way ML.NET conducts schema comprehension: |
| 177 | +* `[ColumnName]` lets you choose a different name for the `IDataView` column. By default it is the same as field name. |
| 178 | + * This is a way to create or read back an `IDataView` column with a name containing 'invalid' characters (like whitespace). |
| 179 | +* `[NoColumn]` is an attribute that denotes that the below field should not be mapped to a column. |
| 180 | + |
| 181 | +### Using SchemaDefinition for run-time type mapping hints |
| 182 | + |
| 183 | +As you can see from the table and notes above, certain `IDataView` types can only be denoted with an additional field attribute. If the type parameters are not known at compile time (like the size of the fixed-size vector), this is tricky. |
| 184 | + |
| 185 | +You can use a `SchemaDefinition` object to re-map a type to an `IDataView` schema programmatically. It gives you the same powers as the attributes, but at runtime. |
| 186 | +Please see the below example. |
| 187 | +```C# |
| 188 | +// Vector size is only known at runtime. |
| 189 | +int numberOfFeatures = 4; |
| 190 | + |
| 191 | +// Create the default schema definition. |
| 192 | +var schemaDef = SchemaDefinition.Create(typeof(IrisVectorData)); |
| 193 | + |
| 194 | +// Specify the right vector size. |
| 195 | +schemaDef["Features"].ColumnType = new VectorType(NumberType.R4, numberOfFeatures); |
| 196 | + |
| 197 | +// Create a data view. |
| 198 | +var dataView = env.CreateDataView<IrisVectorData>(arr, schemaDef); |
| 199 | + |
| 200 | +// Create a prediction engine. You can add custom input and output schema definitions there. |
| 201 | +var predictionEngine = env.CreatePredictionEngine<IrisData, IrisVectorData>(dv, outputSchemaDefinition: schemaDef); |
| 202 | +``` |
| 203 | + |
| 204 | +In addition to the above, you can use `SchemaDefinition` to add per-column metadata: |
| 205 | +```C# |
| 206 | +// Add column metadata. |
| 207 | +schemaDef["Label"].AddMetadata(MetadataUtils.Kinds.HasMissingValues, false); |
| 208 | +``` |
| 209 | + |
| 210 | +## Limitations |
| 211 | + |
| 212 | +Certain things are not possible to do at all using the schema comprehensions, but are possible via the native `IDataView` programmatic interface. |
| 213 | +It was our design decision to not allow these scenarios, thus simplifying the other, more common scenarios. |
| 214 | + |
| 215 | +Here is the list of things that are only possible via the low-level interface: |
| 216 | +* Creating or reading a data view, where even column *types* are not known at compile time (so you cannot create a C# class to define the schema) |
| 217 | + * This can happen if you write a general-purpose machine learning tool that can ingest different kinds of datasets. |
| 218 | +* Reading a subset of columns that differs from one row to another: the cursor always populates the entire row object. |
| 219 | +* Reading column metadata from the data view. |
| 220 | +* Accessing the 'hidden' data view columns by index. |
| 221 | + * Hidden columns are those that have the same name as other columns and a smaller index. They are not accessible by name. |
| 222 | +* Creating 'cursor sets': this is a feature that lets you iterate over data in multiple parallel threads by splitting the data between multiple 'sibling' cursors. |
0 commit comments