Skip to content

Commit 350a3f5

Browse files
author
Mingsheng Hong
committed
Part 4 of cross-device sends/recvs support: Added initial support for SIL (#17165)
Part 4 of cross-device sends/recvs support: Added initial support for SIL accelerator function partitioning based on TF devices {CPU, GPU}, including control flow. Summary of changes: 1. Extended the DeviceType with an ALL enum value, indicating that an associated instruction runs on all devices with which the TF computation is involved. For example, promoted scalars run on ALL devices. Also, for ease of control flow handling, BB args are present on ALL devices. The exception is the function input arguments, which are only present in the primary device function (recall the primary function is the partitioned function that runs on a target device given by TensorFlow.enableGPU(), TensorFlow.enableTPU() or a default policy), while the helper functions do not take input or output tensors. 2. Added a new pass DevicePartitioner that sits between the PartitionerCloner pass in TFPartition and the TFGraphLowering pass in TFLowerGraph. It has two phases described as follows. In the analysis/mark phase, it inserts instructions for cross-device tensor sends/recvs, represented by "__tfop_tfc.TensorTransfer" builtin's. For example, when tensor x is produced on device D1, and is then consumed by tensor op foo() on device D2, it inserts right before foo() a "__tfop_tfc.TensorTransfer" builtin to send that tensor from D1 to D2. This builtin helps maintain the invariant that for any instruction I running on some device D, for any operand OP of I, OP must be present on D (either because OP is produced on D, or it is transfered via this builtin). When tf-dump-graph flag is on, the output SIL of this phase is dumped under a header like: --- TFDevicePartition Cross Device Tensor Transfer Annotation Result: $S3tmp10testScalar1fySf_tF.tf In the partitioning phase (DevicePartitionCloner), it extracts all instructions related to a given target device D into a new SIL function, to be lowered by TFGraphLowering. For a "__tfop_tfc.TensorTransfer" builtin: - If D is its source/send device, it gets lowered to a TF _Send op in the CPU/GPU device context, via a "__tfop_tfc.D2DTensorSend" builtin. - If D is its dest/recv device, it gets lowered to a TF _Recv op in the CPU/GPU device context, via a "__tfop_tfc.D2DTensorRecv" builtin. For control flow support, in each partitioned, device-spcific SIL function produced by DevicePartitionCloner, it retains all basic blocks from the input accelerator SIL function, along with the BB args. When tf-dump-graph flag is on, the output of this phase is dumped under a header like: --- TFDevicePartition Per-Device Function Extraction Result: $S3tmp10testScalar1fySf_tF.tf_CPU.device_partition 3. Extended the TFGraphLowering pass to turn D2DTensorSend/D2DTensorRecv into TF _Send and _Recv nodes. These nodes work on CPU and GPU. In the TPU device context, the above can be lowered to infeed/outfeed or HostCompute. This is to be explored later. 4. Also upgraded "tensorflowSend" and "tensorflowReceive" built-ins with "tfc.SendToHost" and "tfc.RecvFromHost" builtins, with proper tfop attributes to represent the tensor transfer id and send/recv devices.
1 parent 9d0bb15 commit 350a3f5

File tree

10 files changed

+1308
-167
lines changed

10 files changed

+1308
-167
lines changed

lib/SILOptimizer/Mandatory/CMakeLists.txt

+1
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ set(MANDATORY_SOURCES
2121
Mandatory/TFCanonicalizeCFG.cpp
2222
Mandatory/TFConstExpr.cpp
2323
Mandatory/TFDeabstraction.cpp
24+
Mandatory/TFDevicePartition.cpp
2425
Mandatory/TFLowerGraph.cpp
2526
Mandatory/TFPartition.cpp
2627
Mandatory/TFUtilities.cpp

0 commit comments

Comments
 (0)