Skip to content

Commit 9c03a1c

Browse files
authored
Update IDataView principles, type system documentation. (#3288)
* Update IDataView principles, type system documentation.
1 parent c601b77 commit 9c03a1c

File tree

5 files changed

+442
-437
lines changed

5 files changed

+442
-437
lines changed

docs/code/IDataViewDesignPrinciples.md

Lines changed: 106 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,9 @@ directly address distributed data and computation, but is suitable for single
1212
node processing of data partitions belonging to larger distributed data sets.
1313

1414
IDataView is the data pipeline machinery for ML.NET. Microsoft teams consuming
15-
this library have implemented libraries of IDataView related components
16-
(loaders, transforms, savers, trainers, predictors, etc.) and have validated
17-
the performance, scalability and task flexibility benefits.
15+
this library have implemented libraries of IDataView related components (data
16+
loaders, transformers, estimators, trainers, etc.) and have validated the
17+
performance, scalability and task flexibility benefits.
1818

1919
The name IDataView was inspired from the database world, where the term table
2020
typically indicates a mutable body of data, while a view is the result of a
@@ -128,10 +128,9 @@ The IDataView system design does *not* include the following:
128128
view.
129129

130130
* **Efficient indexed row access**: There is no standard way in the IDataView
131-
system to request the values for a specific row number. While the
132-
`IRowCursor` interface has a `MoveMany(long count)` method, it only supports
133-
moving forward `(count > 0)`, and is not necessarily more efficient than
134-
calling `MoveNext()` repeatedly. See [here](#row-cursor).
131+
system to request the values for a specific row number. Rather, like
132+
enumerators, one can only move forward with `MoveNext()`. See
133+
[here](#row-cursor) for more information.
135134

136135
* **Data file formats**: The IDataView system does not dictate storage or
137136
transport formats. It *does* include interfaces for loader and saver
@@ -167,15 +166,15 @@ there is a precisely defined set of standard types including:
167166
* Single and Double precision floating point
168167
* Signed integer values using 1, 2, 4, or 8 bytes
169168
* Unsigned integer values using 1, 2, 4, or 8 bytes
170-
* Unsigned 16 byte values for ids and probabilistically unique hashes
169+
* Values for ids and probabilistically unique hashes, using 16 bytes
171170
* Date time, date time zone, and timespan
172171
* Key types
173172
* Vector types
174173

175174
The set of standard types will likely be expanded over time.
176175

177-
The IDataView type system is specified in a separate document, *IDataView Type
178-
System Specification*.
176+
The IDataView type system is specified in a separate document, [IDataView Type
177+
System Specification](IDataViewTypeSystem.md).
179178

180179
IDataView provides a general mechanism for associating semantic annotations with
181180
columns, such as designating sets of score columns, names associated with the
@@ -235,56 +234,85 @@ with a key type.
235234
## Components
236235

237236
The IDataView system includes several standard kinds of components and the
238-
ability to compose them to produce efficient data pipelines. A loader
239-
represents a data source as an `IDataView`. A transform is applied to an
240-
`IDataView` to produce a derived `IDataView`. A saver serializes the data
241-
produced by an `IDataView` to a stream, in some cases in a format that can be
242-
read by a loader. There are other more specific kinds of components defined
243-
and used by the ML.NET code base, for example, scorers, evaluators, joins, and
244-
caches. While there are several standard kinds of components, the set of
245-
component kinds is open.
246-
247-
### Transforms
248-
249-
Transforms are a foundational kind of IDataView component. Transforms take an
250-
IDataView as input and produce an IDataView as output. Many transforms simply
251-
"add" one or more computed columns to their input schema. More precisely,
252-
their output schema includes all the columns of the input schema, plus some
253-
additional columns, whose values are computed from some of the input column
254-
values. It is common for an added column to have the same name as an input
255-
column, in which case, the added column hides the input column. Both the
256-
original column and new column are present in the output schema and available
257-
for downstream components (in particular, savers and diagnostic tools) to
258-
inspect. For example, a normalization transform may, for each slot of a
259-
vector-valued column named Features, apply an offset and scale factor and
260-
bundle the results in a new vector-valued column, also named Features. From
261-
the user's perspective (which is entirely based on column names), the Features
262-
column was "modified" by the transform, but the original values are available
263-
downstream via the hidden column.
264-
265-
Some transforms require training, meaning that their precise behavior is
266-
determined automatically from some training data. For example, normalizers and
267-
dictionary-based mappers, such as the TermTransform, build their state from
268-
training data. Training occurs when the transform is instantiated from user-
269-
provided parameters. Typically, the transform behavior is later serialized.
270-
When deserialized, the transform is not retrained; its behavior is entirely
271-
determined by the serialized information.
237+
ability to compose them to produce efficient data pipelines:
238+
239+
Estimators and transformers. The langauge is derived from a similar idiom in
240+
[Spark](https://spark.apache.org/).
241+
242+
A data loader allows data sources to be read as an `IDataView`. A transformer
243+
is applied via its `Transform` method to an `IDataView` to produce a derived
244+
`IDataView`. A saver serializes the data produced by an `IDataView` to a
245+
stream, in some cases in a format that can be read by a loader. There are
246+
other more specific kinds of components defined and used by the ML.NET code
247+
base, for example, scorers, evaluators, joins, and caches, but most of the
248+
aforementioned are internal. While there are several standard kinds of
249+
components, the set of component kinds is open. In the following sections we
250+
discuss the most important types of components in the public API of ML.NET.
251+
252+
### Transformers
253+
254+
Transformers are a foundational kind of IDataView component. They have two
255+
primary responsibilities, from a user's point of view.
256+
257+
As the name suggests, the primary method is `Transform`, which takes
258+
`IDataView` as input and produce an `IDataView` as output, using the
259+
`ITransformer.Transform` method. Many transformers simply "add" one or more
260+
computed columns to their input schema. More precisely, their output schema
261+
includes all the columns of the input schema, plus some additional columns,
262+
whose values are computed from some of the input column values. It is common
263+
for an added column to have the same name as an input column, in which case,
264+
the added column hides the input column. Both the original column and new
265+
column are present in the output schema and available for downstream
266+
components (in particular, savers and diagnostic tools) to inspect. For
267+
example, a data view that comes from a normalization transformer may, for each
268+
slot of a vector-valued column named `"Features"`, apply an offset and scale
269+
factor and bundle the results in a new vector-valued column, also named
270+
`"Features"`. From the user's perspective (which is entirely based on column
271+
names), the Features column was "modified" by the transform, but the original
272+
values are available downstream via the hidden column.
273+
274+
Transformers, being identified as central to our concept of a "model," are
275+
serializable. When deserialized, the transformer should behave identically to
276+
its serialized version.
277+
278+
### Estimators
279+
280+
Many transformers require training, meaning that their precise behavior is
281+
determined from some training data. For example, normalizers and
282+
dictionary-based mappers, such as the `ValueToKeyMappingTransformer`, build
283+
their state from training data.
284+
285+
This training occurs with another structure generally parallel to
286+
`ITransformer` called `IEstimator` configured using user parameters, that
287+
returns an `ITransformer` once it is `Fit`. For example,
288+
`NormalizingEstimator` implements `IEstimator<NormalizingTransformer>`, so the
289+
return value of `Fit` is `NormalizingTransformer`.
272290

273291
### Composition Examples
274292

275293
Multiple primitive transforms may be applied to achieve higher-level
276-
semantics. For example, ML.NET's `CategoricalTransform` is the composition of
277-
two more primitive transforms, `TermTransform`, which maps each term to a key
278-
value via a dictionary, and `KeyToVectorTransform`, which maps from key value
279-
to indicator vector. Similarly, `CategoricalHashTransform` is the composition
280-
of `HashTransform`, which maps each term to a key value via hashing, and
281-
`KeyToVectorTransform`.
282-
283-
Similarly, `WordBagTransform` and `WordHashBagTransform` are each the
284-
composition of three transforms. `WordBagTransform` consists of
285-
`WordTokenizeTransform`, `TermTransform`, and `NgramTransform`, while
286-
`WordHashBagTransform` consists of `WordTokenizeTransform`, `HashTransform`,
287-
and `NgramHashTransform`.
294+
semantics. For example, ML.NET's `OneHotEncodingTransformer` is the
295+
composition of two more primitive transforms, `ValueToKeyMappingTransformer`,
296+
which maps each term to a key value via a dictionary, and
297+
`KeyToVectorMappingTransformer`, which maps from key value to indicator
298+
vector. Similarly, `OneHotHashEncodingTransformer` is the composition of
299+
`HashingTransformer`, which maps each term to a key value via hashing, and
300+
`ValueToKeyMappingTransformer`.
301+
302+
### Schema Propagation
303+
304+
Because the act of fitting an estimator or transforming data is often
305+
extremely expensive proposition, it is useful for pipelines to know *what*
306+
they will produce and produce some at least preliminary validation on whether
307+
a pipeline can actually work before one goes through doing this.
308+
309+
For this reason, both `ITransformer` and `IEstimator` have methods
310+
`GetOutputSchema`, which respectively take and return `DataViewSchema` and
311+
`SchemaShape`. In this way, programmatically a pipeline can be checked, at
312+
least to some extent, before considerable time might be wasted on an
313+
ultimately futile action is spent in the `Fit` method, since "downstream"
314+
components were configured incorrectly with the wrong types, wrong names, or
315+
some other issue.
288316

289317
## Cursoring
290318

@@ -294,8 +322,7 @@ To access the data in a view, one gets a row cursor from the view by calling
294322
the `GetRowCursor` method. The row cursor is a movable window onto a single
295323
row of the view, known as the current row. The row cursor provides the column
296324
values of the current row. The `MoveNext()` method of the cursor advances to
297-
the next row. There is also a `MoveMany(long count)` method, which is
298-
semantically equivalent to calling `MoveNext()` repeatedly, `count` times.
325+
the next row.
299326

300327
Note that a row cursor is not thread safe; it should be used in a single
301328
execution thread. However, multiple cursors can be active simultaneously on
@@ -318,10 +345,11 @@ column and row directions.
318345
A row cursor has a set of active columns, determined by arguments passed to
319346
`GetRowCursor`. Generally, the cursor, and any upstream components, will only
320347
perform computation or data movement necessary to provide values of the active
321-
columns. For example, when `TermTransform` builds its term dictionary from its
322-
input `IDataView`, it gets a row cursor from the input view with only the term
323-
column active. Any data loading or computation not required to materialize the
324-
term column is avoided. This is lazy computation in the column direction.
348+
columns. For example, when `ValueToKeyMappingTransformer` builds its term
349+
dictionary from its input `IDataView`, it gets a row cursor from the input
350+
view with only the term column active. Any data loading or computation not
351+
required to materialize the term column is avoided. This is lazy computation
352+
in the column direction.
325353

326354
Generally, creating a row cursor is a very cheap operation. The expense is in
327355
the data movement and computation required to iterate over the rows. If a
@@ -360,17 +388,19 @@ encourage parallel execution. If the view is a transform that can benefit from
360388
parallelism, it requests from its input view, not just a cursor, but a cursor
361389
set. If that view is a transform, it typically requests from its input view a
362390
cursor set, etc., on up the transformation chain. At some point in the chain
363-
(perhaps at a loader), a component, called the splitter, determines how many
364-
cursors should be active, creates those cursors, and returns them together
365-
with a consolidator object. At the other end, the consolidator is invoked to
366-
marshal the multiple cursors back into a single cursor. Intervening levels
367-
simply create a cursor on each input cursor, return that set of cursors as
368-
well as the consolidator.
369-
370-
The ML.NET code base includes transform base classes that implement the
371-
minimal amount of code required to support this batch parallel cursoring
372-
design. Consequently, most transform implementations do not have any special
373-
code to support batch parallel cursoring.
391+
(perhaps at a loader), a component determines how many cursors should be
392+
active, creates those cursors, and returns them. These cursors can be either
393+
independently processed in different threads, or else an internal utility
394+
method is invoked to marshal the multiple cursors back into a single cursor.
395+
Intervening levels simply create a cursor on each input cursor, return that
396+
set of cursors as well as the consolidator.
397+
398+
The ML.NET code base includes internal `IDataView` implementations that
399+
implement the minimal amount of code required to support this batch parallel
400+
cursoring design, most notably by the `IDataView` implementors returned from
401+
`ITransformer` implementations that that are also one-to-one mappers.
402+
Consequently, most transformer implementations do not have any special code to
403+
support batch parallel cursoring.
374404

375405
### Memory Efficiency
376406

@@ -415,9 +445,9 @@ the random number generator. Serving rows from disk in a random order is quite
415445
difficult to do efficiently (without seeking for each row). The binary .idv
416446
loader has some shuffling support, favoring performance over attempting to
417447
provide a uniform distribution over the permutation space. This level of
418-
support has been validated to be sufficient for machine learning goals (for example,
419-
in recent work on SA-SDCA algorithm). When the data is all in memory, as it is
420-
when cached, randomizing is trivial.
448+
support has been validated to be sufficient for machine learning goals (for
449+
example, in recent work on SA-SDCA algorithm). When the data is all in memory,
450+
as it is when cached, randomizing is trivial.
421451

422452
## Appendix: Comparison with LINQ
423453

0 commit comments

Comments
 (0)