Adding initial guide to Tensors, along with a general docs Readme (#604)

BradLarson · web-flow · commit 7bbc9ed2e377 · 2021-02-04T15:31:11.000-06:00
* Added new Tensor guide and an overview Readme for the docs.

* Extracted accelerator backends to their own guide, added X10 debugging guide, added model summary guide.

* Moved Tensor guides into the main section and added a brief description of _Raw operators.
diff --git a/docs/README.md b/docs/README.md
@@ -0,0 +1,39 @@
+# Swift for TensorFlow documentation
+
+This is the primary location of Swift for TensorFlow's documentation and tutorials.
+
+Formatted versions of the current guides, tutorials, and automatically generated API documentation
+can be found at [tensorflow.org/swift](https://www.tensorflow.org/swift). The original versions
+of the guides can be found [here](site/guide/).
+
+## Tutorials ![](https://www.tensorflow.org/images/colab_logo_32px.png)
+
+Tutorial | Last Updated |
+-------- | ------------ |
+[A Swift Tour](https://colab.research.google.com/github/tensorflow/swift/blob/main/docs/site/tutorials/a_swift_tour.ipynb) | March 2019
+[Protocol-Oriented Programming & Generics](https://colab.research.google.com/github/tensorflow/swift/blob/main/docs/site/tutorials/protocol_oriented_generics.ipynb) | August 2019
+[Python Interoperability](https://colab.research.google.com/github/tensorflow/swift/blob/main/docs/site/tutorials/python_interoperability.ipynb) | March 2019
+[Custom Differentiation](https://colab.research.google.com/github/tensorflow/swift/blob/main/docs/site/tutorials/custom_differentiation.ipynb) | March 2019
+[Sharp Edges in Differentiability](https://colab.research.google.com/github/tensorflow/swift/blob/main/docs/site/tutorials/Swift_autodiff_sharp_edges.ipynb) | November 2020
+[Model Training Walkthrough](https://colab.research.google.com/github/tensorflow/swift/blob/main/docs/site/tutorials/model_training_walkthrough.ipynb) | March 2019
+[Raw TensorFlow Operators](https://colab.research.google.com/github/tensorflow/swift/blob/main/docs/site/tutorials/raw_tensorflow_operators.ipynb) | December 2019
+[Introducing X10, an XLA-Based Backend](https://colab.research.google.com/github/tensorflow/swift/blob/main/docs/site/tutorials/introducing_x10.ipynb) | May 2020
+
+## Technology reference
+
+Many different technological directions have been explored over the lifetime of the project. An 
+archive of reference guides, some now obsolete, can be found here:
+
+Document | Last Updated | Status |
+-------- | ------------ | ------ |
+[Swift Differentiable Programming Manifesto](https://github.com/apple/swift/blob/main/docs/DifferentiableProgramming.md) | January 2020 | Current
+[Swift Differentiable Programming Implementation Overview](https://docs.google.com/document/d/1_BirmTqdotglwNTOcYAW-ib6mx_jl-gH9Dbg4WmHZh0) | August 2019 | Current
+[Swift Differentiable Programming Design Overview](https://docs.google.com/document/d/1bPepWLfRQa6CtXqKA8CDQ87uZHixNav-TFjLSisuKag/edit?usp=sharing) | June 2019 | Outdated
+[Differentiable Types](DifferentiableTypes.md) | March 2019 | Outdated
+[Differentiable Functions and Differentiation APIs](DifferentiableFunctions.md) | March 2019 | Outdated
+[Dynamic Property Iteration using Key Paths](DynamicPropertyIteration.md) | March 2019 | Current
+[Hierarchical Parameter Iteration and Optimization](ParameterOptimization.md) | March 2019 | Current
+[First-Class Automatic Differentiation in Swift: A Manifesto](https://gist.github.com/rxwei/30ba75ce092ab3b0dce4bde1fc2c9f1d) | October 2018 | Outdated
+[Automatic Differentiation Whitepaper](AutomaticDifferentiation.md) | April 2018 | Outdated
+[Python Interoperability](PythonInteroperability.md) | April 2018 | Current
+[Graph Program Extraction](GraphProgramExtraction.md) | April 2018 | Outdated
diff --git a/docs/site/_book.yaml b/docs/site/_book.yaml
@@ -21,7 +21,15 @@ upper_tabs:
       - title: Swift differentiable programming manifesto
         path: https://github.com/apple/swift/blob/main/docs/DifferentiableProgramming.md
         status: external
+      - title: Tensors
+        path: /swift/guide/tensors
+      - title: Accelerator backends
+        path: /swift/guide/backends
+      - title: Debugging X10 issues
+        path: /swift/guide/debugging_x10
       - heading: "Machine learning models"
+      - title: Model summaries
+        path: /swift/guide/model_summary
       - title: Swift for TensorFlow model garden
         path: https://github.com/tensorflow/swift-models
         status: external
diff --git a/docs/site/guide/backends.md b/docs/site/guide/backends.md
@@ -0,0 +1,101 @@
+# Accelerator backends
+
+It's pretty straightforward to describe a `Tensor` calculation, but when and how that calculation 
+is performed will depend on which backend is used for the `Tensor`s and when the results
+are needed on the host CPU.
+
+Behind the scenes, operations on `Tensor`s are dispatched to accelerators like GPUs or 
+[TPUs](https://cloud.google.com/tpu), or run on the CPU when no accelerator is available. This
+happens automatically for you, and makes it easy to perform complex parallel calculations using
+a high-level interface. However, it can be useful to understand how this dispatch occurs and be
+able to customize it for optimal performance.
+
+Swift for TensorFlow has two backends for performing accelerated computation: TensorFlow eager mode
+and X10. The default backend is TensorFlow eager mode, but that can be overridden. An
+[interactive tutorial](https://colab.research.google.com/github/tensorflow/swift/blob/main/docs/site/tutorials/introducing_x10.ipynb)
+is available that walks you through the use of these different backends.
+
+## TensorFlow eager mode
+
+The TensorFlow eager mode backend leverages
+[the TensorFlow C API](https://www.tensorflow.org/install/lang_c) to send each `Tensor` operation
+to a GPU or CPU as it is encountered. The result of that operation is then retrieved and passed on
+to the next operation.
+
+This operation-by-operation dispatch is straightforward to understand and requires no explicit 
+configuration within your code. However, in many cases it does not result in optimal performance 
+due to the overhead from sending off many small operations, combined with the lack of operation 
+fusion and optimization that can occur when graphs of operations are present. Finally, TensorFlow eager mode is incompatible with TPUs, and can only be used with CPUs and GPUs.
+
+## X10 (XLA-based tracing)
+
+X10 is the name of the Swift for TensorFlow backend that uses lazy tensor tracing and [the XLA
+optimizing compiler](https://www.tensorflow.org/xla) to in many cases significantly improve
+performance over operation-by-operation dispatch. Additionally, it adds compatibility for
+[TPUs](https://cloud.google.com/tpu), accelerators specifically optimized for the kinds of
+calculations found within machine learning models.
+
+The use of X10 for `Tensor` calculations is not the default, so you need to opt in to this backend.
+That is done by specifying that a `Tensor` is placed on an XLA device:
+
+```swift
+let tensor1 = Tensor<Float>([0.0, 1.0, 2.0], on: Device.defaultXLA)
+let tensor2 = Tensor<Float>([1.5, 2.5, 3.5], on: Device.defaultXLA)
+```
+
+After that point, describing a calculation is exactly the same as for TensorFlow eager mode:
+
+```swift
+let tensor3 = tensor1 + tensor2
+```
+
+Further detail can be provided when creating a `Tensor`, such as what kind of accelerator to use
+and even which one, if several are available. For example, a `Tensor` can be created on the second
+TPU device (assuming it is visible to the host the program is running on) using the following:
+
+```swift
+let tpuTensor = Tensor<Float>([0.0, 1.0, 2.0], on: Device(kind: .TPU, ordinal: 1, backend: .XLA))
+```
+
+No implicit movement of `Tensor`s between devices is performed, so if two `Tensor`s on different
+devices are used in an operation together, a runtime error will occur. To manually copy the 
+contents of a `Tensor` to a new device, you can use the `Tensor(copying:to:)` initializer. Some 
+larger-scale structures that contain `Tensor`s within them, like models and optimizers, have helper
+functions for moving all of their interior `Tensor`s to a new device in one step.
+
+Unlike TensorFlow eager mode, operations using the X10 backend are not individually dispatched as
+they are encountered. Instead, dispatching to an accelerator is only triggered by either reading
+calculated values back to the host or by placing an explicit barrier. The way this works is that
+the runtime starts from the value being read to the host (or the last calculation before a manual
+barrier) and traces the graph of calculations that result in that value.
+
+This traced graph is then converted to the XLA HLO intermediate representation and passed to the
+XLA compiler to be optimized and compiled for execution on the accelerator. From there, the entire
+calculation is sent to the accelerator and the end result obtained.
+
+Calculation is a time-consuming process, so X10 is best used with massively parallel calculations
+that are expressed via a graph and that are performed many times. Hash values and caching are used so that identical graphs are only compiled once for every unique configuration.
+
+For machine learning models, the training process often involves a loop where the model is
+subjected to the same series of calculations over and over. You'll want each of these passes to be
+seen as a repetition of the same trace, rather than one long graph with repeated units inside it.
+This is enabled by the manual insertion of a call to `LazyTensorBarrier()` function at the 
+locations in your code where you wish for a trace to end.
+
+### Mixed-precision support in X10
+
+Training with mixed precision via X10 is supported and both low-level and
+high-level API are provided to control it. The
+[low-level API](https://github.com/tensorflow/swift-apis/blob/main/Sources/TensorFlow/Core/MixedPrecision.swift)
+offers two computed properties: `toReducedPrecision` and `toFullPrecision` which
+convert between full and reduced precision, along with `isReducedPrecision`
+to query the precision. Besides `Tensor`s, models and optimizers can be converted
+between full and reduced precision using this API.
+
+Note that conversion to reduced precision doesn't change the logical type of a
+`Tensor`. If `t` is a `Tensor<Float>`, `t.toReducedPrecision` is also a
+`Tensor<Float>` with a reduced-precision underlying representation.
+
+As with devices, operations between tensors of different precisions are not
+allowed. This avoids silent and unwanted promotion to 32-bit floats, which would be hard
+to detect by the user.
diff --git a/docs/site/guide/debugging_x10.md b/docs/site/guide/debugging_x10.md
@@ -0,0 +1,150 @@
+# Debugging X10 issues
+
+The X10 accelerator backend can provide significantly higher throughput for graph-based parallel
+computation, but its deferred tracing and just-in-time compilation can lead to non-obvious behavior
+sometimes. This might include frequent recompilation of traces due to graph or tensor shape changes,
+or huge graphs that lead to memory issues during compilation.
+
+One way to diagnose issues is to use the execution metrics and counters provided by
+X10. The first thing to check when a model is slow is to generate a metrics
+report.
+
+# Metrics
+
+To print a metrics report, add a `PrintX10Metrics()` call to your program:
+
+```swift
+import TensorFlow
+
+...
+PrintX10Metrics()
+...
+```
+
+This will log various metrics and counters at the `INFO` level.
+
+## Understanding the metrics report
+
+The report includes things like:
+
+-   How many times we trigger XLA compilations and the total time spent on
+    compilation.
+-   How many times we launch an XLA computation and the total time spent on
+    execution.
+-   How many device data handles we create / destroy, etc.
+
+This information is reported in terms of percentiles of the samples. An example
+is:
+
+```
+Metric: CompileTime
+  TotalSamples: 202
+  Counter: 06m09s401ms746.001us
+  ValueRate: 778ms572.062us / second
+  Rate: 0.425201 / second
+  Percentiles: 1%=001ms32.778us; 5%=001ms61.283us; 10%=001ms79.236us; 20%=001ms110.973us; 50%=001ms228.773us; 80%=001ms339.183us; 90%=001ms434.305us; 95%=002ms921.063us; 99%=21s102ms853.173us
+```
+
+We also provide counters, which are named integer variables which track internal
+software status. For example:
+
+```
+Counter: CachedSyncTensors
+  Value: 395
+```
+
+## Known caveats
+
+`Tensor`s backed by X10 behave semantically like default eager mode`Tensor`s. However, there are 
+some performance and completeness caveats:
+
+1.  Degraded performance because of too many recompilations.
+
+    XLA compilation is expensive. X10 automatically recompiles the graph every
+    time new shapes are encountered, with no user intervention. Models need to
+    see stabilized shapes within a few training steps and from that point no
+    recompilation is needed. Additionally, the execution paths must stabilize
+    quickly for the same reason: X10 recompiles when a new execution path is
+    encountered. To sum up, in order to avoid recompilations:
+
+    *   Avoid highly variable dynamic shapes. However, a low number of different
+        shapes could be fine. Pad tensors to fixed sizes when possible.
+    *   Avoid loops with different number of iterations between training steps.
+        X10 currently unrolls loops, therefore different number of loop
+        iterations translate into different (unrolled) execution paths.
+
+2.  A small number of operations aren't supported by X10 yet.
+
+    We currently have a handful of operations which aren't supported, either
+    because there isn't a good way to express them via XLA and static shapes
+    (currently just `nonZeroIndices`) or lack of known use cases (several linear
+    algebra operations and multinomial initialization). While the second
+    category is easy to address as needed, the first category can only be
+    addressed through interoperability with the CPU, non-XLA implementation.
+    Using interoperability too often has significant performance implications
+    because of host round-trips and fragmenting a fully fused model in multiple
+    traces. Users are therefore advised to avoid using such operations in their
+    models.
+
+    On Linux, use `XLA_SAVE_TENSORS_FILE` (documented in the next section) to
+    get the Swift stack trace which called the unsupported operation. Function
+    names can be manually demangled using `swift-demangle`.
+
+
+# Obtaining and graphing traces
+
+If you suspect there are problems with the way graphs are being traced, or want to understand the
+tracing process, tools are provided to log out and visualize traces. You can have X10 log out the
+traces it finds by setting the `XLA_SAVE_TENSORS_FILE` environment variable:
+
+```sh
+export XLA_SAVE_TENSORS_FILE=/home/person/TraceLog.txt
+```
+
+These trace logs come in three formats: `text`, `hlo`, and `dot`, with the format settable through
+the environment variable XLA_SAVE_TENSORS_FMT:
+
+```sh
+export XLA_SAVE_TENSORS_FMT=text
+```
+
+When you run your application, the `text` representation that is logged out will show each 
+individual trace in a high-level text notation used by X10. The `hlo` representation shows the 
+intermediate representation that is passed to the XLA compiler. You may want to restrict the number 
+of iterations within your training or calculation loops to prevent these logs from becoming too large. Also, each run of your application will append to this file, so you may wish to delete it
+between runs.
+
+Setting the variable `XLA_LOG_GRAPH_CHANGES` to 1 will also indicate within the trace log where
+changes in the graph have occurred. This is extremely helpful in finding places where recompilation
+will result.
+
+For a visual representation of a trace, the `dot` option will log out Graphviz-compatible graphs. If
+you extract the portion of a trace that looks like
+
+```
+digraph G {
+	...
+}
+```
+
+into its own file, Graphviz (assuming it is installed) can generate a visual diagram via
+
+```sh
+dot -Tpng trace.dot -o trace.png
+```
+
+Note that setting the `XLA_SAVE_TENSORS_FILE` environment variable, especially when used in 
+combination with `XLA_LOG_GRAPH_CHANGES` will have a substantial negative impact on performance.
+Only use these when debugging, and not for regular operation.
+
+# Additional environment variables
+
+Additional environment variables for debugging include:
+
+*   `XLA_USE_BF16`: If set to 1, transforms all the `Float` values to BF16.
+    Should only be used for debugging since we offer automatic mixed precision.
+
+*   `XLA_USE_32BIT_LONG`: If set to 1, maps S4TF `Long` type to the XLA 32 bit
+    integer type. On TPU, 64 bit integer computations are expensive, so setting
+    this flag might help. Of course, the user needs to be certain that the
+    values still fit in a 32 bit integer.
diff --git a/docs/site/guide/model_summary.md b/docs/site/guide/model_summary.md
@@ -0,0 +1,67 @@
+# Model Summaries
+
+A summary provides details about the architecture of a model, such as layer
+types and shapes.
+
+The design proposal can be found [here][design]. This
+implementation is a WIP, so please file an [Issue][new_issue] with
+enhancements you would like to see or problems you run into.
+
+**Note:** Model summaries are currently supported on the X10 backend only.
+
+## Viewing a model summary
+
+Create an X10 device and model.
+
+```
+import TensorFlow
+
+public struct MyModel: Layer {
+  public var dense1 = Dense<Float>(inputSize: 1, outputSize: 1)
+  public var dense2 = Dense<Float>(inputSize: 4, outputSize: 4)
+  public var dense3 = Dense<Float>(inputSize: 4, outputSize: 4)
+  public var flatten = Flatten<Float>()
+
+  @differentiable
+  public func callAsFunction(_ input: Tensor<Float>) -> Tensor<Float> {
+    let layer1 = dense1(input)
+    let layer2 = layer1.reshaped(to: [1, 4])
+    let layer3 = dense2(layer2)
+    let layer4 = dense3(layer3)
+    return flatten(layer4)
+  }
+}
+
+let device = Device.defaultXLA
+let model0 = MyModel()
+let model = MyModel(copying: model0, to: device)
+```
+
+Create an input tensor.
+
+```
+let input = Tensor<Float>(repeating: 1, shape: [1, 4, 1, 1], on: device)
+```
+
+Generate a summary of your model.
+
+```
+let summary = model.summary(input: input)
+print(summary)
+```
+
+```
+Layer                           Output Shape         Attributes
+=============================== ==================== ======================
+Dense<Float>                    [1, 4, 1, 1]
+Dense<Float>                    [1, 4]
+Dense<Float>                    [1, 4]
+Flatten<Float>                  [1, 4]
+```
+
+**Note:** the `summary()` function executes the model in order to obtain
+details about its architecture.
+
+
+[design]: https://docs.google.com/document/d/1hEhMiwLtuzsN3RvIC3FAh6NvtTimU8o_qdzMkGvntVg/view
+[new_issue]: https://github.com/tensorflow/swift-apis/issues/new
diff --git a/docs/site/guide/tensors.md b/docs/site/guide/tensors.md