|
| 1 | +# Debugging X10 issues |
| 2 | + |
| 3 | +The X10 accelerator backend can provide significantly higher throughput for graph-based parallel |
| 4 | +computation, but its deferred tracing and just-in-time compilation can lead to non-obvious behavior |
| 5 | +sometimes. This might include frequent recompilation of traces due to graph or tensor shape changes, |
| 6 | +or huge graphs that lead to memory issues during compilation. |
| 7 | + |
| 8 | +One way to diagnose issues is to use the execution metrics and counters provided by |
| 9 | +X10. The first thing to check when a model is slow is to generate a metrics |
| 10 | +report. |
| 11 | + |
| 12 | +# Metrics |
| 13 | + |
| 14 | +To print a metrics report, add a `PrintX10Metrics()` call to your program: |
| 15 | + |
| 16 | +```swift |
| 17 | +import TensorFlow |
| 18 | + |
| 19 | +... |
| 20 | +PrintX10Metrics() |
| 21 | +... |
| 22 | +``` |
| 23 | + |
| 24 | +This will log various metrics and counters at the `INFO` level. |
| 25 | + |
| 26 | +## Understanding the metrics report |
| 27 | + |
| 28 | +The report includes things like: |
| 29 | + |
| 30 | +- How many times we trigger XLA compilations and the total time spent on |
| 31 | + compilation. |
| 32 | +- How many times we launch an XLA computation and the total time spent on |
| 33 | + execution. |
| 34 | +- How many device data handles we create / destroy, etc. |
| 35 | + |
| 36 | +This information is reported in terms of percentiles of the samples. An example |
| 37 | +is: |
| 38 | + |
| 39 | +``` |
| 40 | +Metric: CompileTime |
| 41 | + TotalSamples: 202 |
| 42 | + Counter: 06m09s401ms746.001us |
| 43 | + ValueRate: 778ms572.062us / second |
| 44 | + Rate: 0.425201 / second |
| 45 | + Percentiles: 1%=001ms32.778us; 5%=001ms61.283us; 10%=001ms79.236us; 20%=001ms110.973us; 50%=001ms228.773us; 80%=001ms339.183us; 90%=001ms434.305us; 95%=002ms921.063us; 99%=21s102ms853.173us |
| 46 | +``` |
| 47 | + |
| 48 | +We also provide counters, which are named integer variables which track internal |
| 49 | +software status. For example: |
| 50 | + |
| 51 | +``` |
| 52 | +Counter: CachedSyncTensors |
| 53 | + Value: 395 |
| 54 | +``` |
| 55 | + |
| 56 | +## Known caveats |
| 57 | + |
| 58 | +`Tensor`s backed by X10 behave semantically like default eager mode`Tensor`s. However, there are |
| 59 | +some performance and completeness caveats: |
| 60 | + |
| 61 | +1. Degraded performance because of too many recompilations. |
| 62 | + |
| 63 | + XLA compilation is expensive. X10 automatically recompiles the graph every |
| 64 | + time new shapes are encountered, with no user intervention. Models need to |
| 65 | + see stabilized shapes within a few training steps and from that point no |
| 66 | + recompilation is needed. Additionally, the execution paths must stabilize |
| 67 | + quickly for the same reason: X10 recompiles when a new execution path is |
| 68 | + encountered. To sum up, in order to avoid recompilations: |
| 69 | + |
| 70 | + * Avoid highly variable dynamic shapes. However, a low number of different |
| 71 | + shapes could be fine. Pad tensors to fixed sizes when possible. |
| 72 | + * Avoid loops with different number of iterations between training steps. |
| 73 | + X10 currently unrolls loops, therefore different number of loop |
| 74 | + iterations translate into different (unrolled) execution paths. |
| 75 | + |
| 76 | +2. A small number of operations aren't supported by X10 yet. |
| 77 | + |
| 78 | + We currently have a handful of operations which aren't supported, either |
| 79 | + because there isn't a good way to express them via XLA and static shapes |
| 80 | + (currently just `nonZeroIndices`) or lack of known use cases (several linear |
| 81 | + algebra operations and multinomial initialization). While the second |
| 82 | + category is easy to address as needed, the first category can only be |
| 83 | + addressed through interoperability with the CPU, non-XLA implementation. |
| 84 | + Using interoperability too often has significant performance implications |
| 85 | + because of host round-trips and fragmenting a fully fused model in multiple |
| 86 | + traces. Users are therefore advised to avoid using such operations in their |
| 87 | + models. |
| 88 | + |
| 89 | + On Linux, use `XLA_SAVE_TENSORS_FILE` (documented in the next section) to |
| 90 | + get the Swift stack trace which called the unsupported operation. Function |
| 91 | + names can be manually demangled using `swift-demangle`. |
| 92 | + |
| 93 | + |
| 94 | +# Obtaining and graphing traces |
| 95 | + |
| 96 | +If you suspect there are problems with the way graphs are being traced, or want to understand the |
| 97 | +tracing process, tools are provided to log out and visualize traces. You can have X10 log out the |
| 98 | +traces it finds by setting the `XLA_SAVE_TENSORS_FILE` environment variable: |
| 99 | + |
| 100 | +```sh |
| 101 | +export XLA_SAVE_TENSORS_FILE=/home/person/TraceLog.txt |
| 102 | +``` |
| 103 | + |
| 104 | +These trace logs come in three formats: `text`, `hlo`, and `dot`, with the format settable through |
| 105 | +the environment variable XLA_SAVE_TENSORS_FMT: |
| 106 | + |
| 107 | +```sh |
| 108 | +export XLA_SAVE_TENSORS_FMT=text |
| 109 | +``` |
| 110 | + |
| 111 | +When you run your application, the `text` representation that is logged out will show each |
| 112 | +individual trace in a high-level text notation used by X10. The `hlo` representation shows the |
| 113 | +intermediate representation that is passed to the XLA compiler. You may want to restrict the number |
| 114 | +of iterations within your training or calculation loops to prevent these logs from becoming too large. Also, each run of your application will append to this file, so you may wish to delete it |
| 115 | +between runs. |
| 116 | + |
| 117 | +Setting the variable `XLA_LOG_GRAPH_CHANGES` to 1 will also indicate within the trace log where |
| 118 | +changes in the graph have occurred. This is extremely helpful in finding places where recompilation |
| 119 | +will result. |
| 120 | + |
| 121 | +For a visual representation of a trace, the `dot` option will log out Graphviz-compatible graphs. If |
| 122 | +you extract the portion of a trace that looks like |
| 123 | + |
| 124 | +``` |
| 125 | +digraph G { |
| 126 | + ... |
| 127 | +} |
| 128 | +``` |
| 129 | + |
| 130 | +into its own file, Graphviz (assuming it is installed) can generate a visual diagram via |
| 131 | + |
| 132 | +```sh |
| 133 | +dot -Tpng trace.dot -o trace.png |
| 134 | +``` |
| 135 | + |
| 136 | +Note that setting the `XLA_SAVE_TENSORS_FILE` environment variable, especially when used in |
| 137 | +combination with `XLA_LOG_GRAPH_CHANGES` will have a substantial negative impact on performance. |
| 138 | +Only use these when debugging, and not for regular operation. |
| 139 | + |
| 140 | +# Additional environment variables |
| 141 | + |
| 142 | +Additional environment variables for debugging include: |
| 143 | + |
| 144 | +* `XLA_USE_BF16`: If set to 1, transforms all the `Float` values to BF16. |
| 145 | + Should only be used for debugging since we offer automatic mixed precision. |
| 146 | + |
| 147 | +* `XLA_USE_32BIT_LONG`: If set to 1, maps S4TF `Long` type to the XLA 32 bit |
| 148 | + integer type. On TPU, 64 bit integer computations are expensive, so setting |
| 149 | + this flag might help. Of course, the user needs to be certain that the |
| 150 | + values still fit in a 32 bit integer. |
0 commit comments