Skip to content

Commit f9059e0

Browse files
committed
chore: resolve merge conflicts
Signed-off-by: Dheeraj Peri <[email protected]>
2 parents d50498d + e07687d commit f9059e0

File tree

266 files changed

+7462
-1752
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

266 files changed

+7462
-1752
lines changed

.circleci/config.yml

+755-53
Large diffs are not rendered by default.

.github/code-owners.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@
110110
- "peri044"
111111
- "bowang007"
112112

113-
"component: docker":
113+
"channel: docker":
114114
- "andi4191"
115115
- "narendasan"
116116

.gitignore

+3
Original file line numberDiff line numberDiff line change
@@ -62,3 +62,6 @@ bazel-Torch-TensorRT-Preview
6262
docsrc/src/
6363
bazel-TensorRT
6464
bazel-tensorrt
65+
.pytest_cache
66+
*.cache
67+
*cifar-10-batches-py*

README.md

+8-7
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,14 @@
22

33
[![Documentation](https://img.shields.io/badge/docs-master-brightgreen)](https://nvidia.github.io/Torch-TensorRT/)
44

5-
> Ahead of Time (AOT) compiling for PyTorch JIT
5+
> Ahead of Time (AOT) compiling for PyTorch JIT and FX
66
7-
Torch-TensorRT is a compiler for PyTorch/TorchScript, targeting NVIDIA GPUs via NVIDIA's TensorRT Deep Learning Optimizer and Runtime. Unlike PyTorch's Just-In-Time (JIT) compiler, Torch-TensorRT is an Ahead-of-Time (AOT) compiler, meaning that before you deploy your TorchScript code, you go through an explicit compile step to convert a standard TorchScript program into an module targeting a TensorRT engine. Torch-TensorRT operates as a PyTorch extention and compiles modules that integrate into the JIT runtime seamlessly. After compilation using the optimized graph should feel no different than running a TorchScript module. You also have access to TensorRT's suite of configurations at compile time, so you are able to specify operating precision (FP32/FP16/INT8) and other settings for your module.
7+
Torch-TensorRT is a compiler for PyTorch/TorchScript/FX, targeting NVIDIA GPUs via NVIDIA's TensorRT Deep Learning Optimizer and Runtime. Unlike PyTorch's Just-In-Time (JIT) compiler, Torch-TensorRT is an Ahead-of-Time (AOT) compiler, meaning that before you deploy your TorchScript code, you go through an explicit compile step to convert a standard TorchScript or FX program into an module targeting a TensorRT engine. Torch-TensorRT operates as a PyTorch extention and compiles modules that integrate into the JIT runtime seamlessly. After compilation using the optimized graph should feel no different than running a TorchScript module. You also have access to TensorRT's suite of configurations at compile time, so you are able to specify operating precision (FP32/FP16/INT8) and other settings for your module.
88

99
Resources:
1010
- [Documentation](https://nvidia.github.io/Torch-TensorRT/)
11-
- [Torch-TensorRT Explained in 2 minutes!](https://www.youtube.com/watch?v=TU5BMU6iYZ0&ab_channel=NVIDIADeveloper)
11+
- [FX path Documentation](https://github.com/pytorch/TensorRT/blob/master/docsrc/tutorials/getting_started_with_fx_path.rst)
12+
- [Torch-TensorRT Explained in 2 minutes!](https://www.youtube.com/watch?v=TU5BMU6iYZ0&ab_channel=NVIDIADeveloper)
1213
- [Comprehensive Discusion (GTC Event)](https://www.nvidia.com/en-us/on-demand/session/gtcfall21-a31107/)
1314
- [Pre-built Docker Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch). To use this container, make an NGC account and sign in to NVIDIA's registry with an API key. Refer to [this guide](https://docs.nvidia.com/ngc/ngc-catalog-user-guide/index.html#registering-activating-ngc-account) for the same.
1415

@@ -111,10 +112,10 @@ torch.jit.save(trt_ts_module, "trt_torchscript_module.ts") # save the TRT embedd
111112
These are the following dependencies used to verify the testcases. Torch-TensorRT can work with other versions, but the tests are not guaranteed to pass.
112113

113114
- Bazel 5.1.1
114-
- Libtorch 1.11.0 (built with CUDA 11.3)
115+
- Libtorch 1.12.0 (built with CUDA 11.3)
115116
- CUDA 11.3
116-
- cuDNN 8.2.1
117-
- TensorRT 8.2.4.2
117+
- cuDNN 8.4.1
118+
- TensorRT 8.4.1.5
118119

119120
## Prebuilt Binaries and Wheel files
120121

@@ -213,7 +214,7 @@ bazel build //:libtorchtrt --compilation_mode opt
213214
```
214215

215216
### FX path (Python only) installation
216-
If the user plan to try FX path (Python only) and would like to avoid bazel build. Please follow the steps below.
217+
If the user plans to try FX path (Python only) and would like to avoid bazel build. Please follow the steps below.
217218
``` shell
218219
cd py && python3 setup.py install --fx-only
219220
```

WORKSPACE

+10-10
Original file line numberDiff line numberDiff line change
@@ -56,17 +56,17 @@ new_local_repository(
5656
http_archive(
5757
name = "libtorch",
5858
build_file = "@//third_party/libtorch:BUILD",
59-
sha256 = "8d9e829ce9478db4f35bdb7943308cf02e8a2f58cf9bb10f742462c1d57bf287",
59+
sha256 = "80f089939de20e68e3fcad4dfa72a26c8bf91b5e77b11042f671f39ebac35865",
6060
strip_prefix = "libtorch",
61-
urls = ["https://download.pytorch.org/libtorch/cu113/libtorch-cxx11-abi-shared-with-deps-1.11.0%2Bcu113.zip"],
61+
urls = ["https://download.pytorch.org/libtorch/cu113/libtorch-cxx11-abi-shared-with-deps-1.12.0%2Bcu113.zip"],
6262
)
6363

6464
http_archive(
6565
name = "libtorch_pre_cxx11_abi",
6666
build_file = "@//third_party/libtorch:BUILD",
67-
sha256 = "90159ecce3ff451f3ef3f657493b6c7c96759c3b74bbd70c1695f2ea2f81e1ad",
67+
sha256 = "8e35371403f7052d9e9b43bcff383980dbde4df028986dc1dab539953481d55f",
6868
strip_prefix = "libtorch",
69-
urls = ["https://download.pytorch.org/libtorch/cu113/libtorch-shared-with-deps-1.11.0%2Bcu113.zip"],
69+
urls = ["https://download.pytorch.org/libtorch/cu113/libtorch-shared-with-deps-1.12.0%2Bcu113.zip"],
7070
)
7171

7272
# Download these tarballs manually from the NVIDIA website
@@ -76,20 +76,20 @@ http_archive(
7676
http_archive(
7777
name = "cudnn",
7878
build_file = "@//third_party/cudnn/archive:BUILD",
79-
sha256 = "0e5d2df890b9967efa6619da421310d97323565a79f05a1a8cb9b7165baad0d7",
80-
strip_prefix = "cuda",
79+
sha256 = "ec96d2376d81fca42bdd3d4c3d705a99b29a065bab57f920561c763e29c67d01",
80+
strip_prefix = "cudnn-linux-x86_64-8.4.1.50_cuda11.6-archive",
8181
urls = [
82-
"https://developer.nvidia.com/compute/machine-learning/cudnn/secure/8.2.4/11.4_20210831/cudnn-11.4-linux-x64-v8.2.4.15.tgz",
82+
"https://developer.nvidia.com/compute/cudnn/secure/8.4.1/local_installers/11.6/cudnn-linux-x86_64-8.4.1.50_cuda11.6-archive.tar.xz",
8383
],
8484
)
8585

8686
http_archive(
8787
name = "tensorrt",
8888
build_file = "@//third_party/tensorrt/archive:BUILD",
89-
sha256 = "826180eaaecdf9a7e76116855b9f1f3400ea9b06e66b06a3f6a0747ba6f863ad",
90-
strip_prefix = "TensorRT-8.2.4.2",
89+
sha256 = "8107861af218694130f170e071f49814fa3e27f1386ce7cb6d807ac05a7fcf0e",
90+
strip_prefix = "TensorRT-8.4.1.5",
9191
urls = [
92-
"https://developer.nvidia.com/compute/machine-learning/tensorrt/secure/8.2.4/tars/tensorrt-8.2.4.2.linux.x86_64-gnu.cuda-11.4.cudnn8.2.tar.gz",
92+
"https://developer.nvidia.com/compute/machine-learning/tensorrt/secure/8.4.1/tars/tensorrt-8.4.1.5.linux.x86_64-gnu.cuda-11.6.cudnn8.4.tar.gz",
9393
],
9494
)
9595

core/compiler.cpp

+19-23
Original file line numberDiff line numberDiff line change
@@ -359,14 +359,6 @@ void MapInputsAndDetermineDTypes(
359359
}
360360
}
361361

362-
uint64_t GetRecommendedWorkspaceSize(const runtime::CudaDevice& device) {
363-
if (device.major < 6) {
364-
return 256 * (1 << 20);
365-
} else {
366-
return 1 << 30;
367-
}
368-
}
369-
370362
std::string ConvertGraphToTRTEngine(const torch::jit::script::Module& mod, std::string method_name, CompileSpec cfg) {
371363
// Go through Lowering to simplify graph and extract weight parameters
372364
auto graph_and_parameters = lowering::Lower(mod, method_name, cfg.lower_info);
@@ -380,14 +372,14 @@ std::string ConvertGraphToTRTEngine(const torch::jit::script::Module& mod, std::
380372
// Infer the type of an input from the weights of the calculation
381373
auto first_use_types = ir::get_block_first_calc_dtypes_opt(g->block());
382374

383-
// GPU default WS size : 1 GB
384-
// Set WS = 256 Mb for Jetson nano/TX1 like platforms whose compute capability is 5.X.
385-
auto workspace_size = cfg.convert_info.engine_settings.workspace_size;
386-
auto device_spec = cfg.convert_info.engine_settings.device;
387-
auto cuda_device = runtime::CudaDevice(device_spec.gpu_id, device_spec.device_type);
388-
if (workspace_size == 0) {
389-
cfg.convert_info.engine_settings.workspace_size = GetRecommendedWorkspaceSize(cuda_device);
390-
}
375+
// // GPU default WS size : 1 GB
376+
// // Set WS = 256 Mb for Jetson nano/TX1 like platforms whose compute capability is 5.X.
377+
// auto workspace_size = cfg.convert_info.engine_settings.workspace_size;
378+
// auto device_spec = cfg.convert_info.engine_settings.device;
379+
// auto cuda_device = runtime::CudaDevice(device_spec.gpu_id, device_spec.device_type);
380+
// if (workspace_size == 0) {
381+
// cfg.convert_info.engine_settings.workspace_size = GetRecommendedWorkspaceSize(cuda_device);
382+
// }
391383

392384
MapInputsAndDetermineDTypes(cfg, g, static_params, first_use_types);
393385

@@ -399,14 +391,14 @@ std::string ConvertGraphToTRTEngine(const torch::jit::script::Module& mod, std::
399391
torch::jit::Module CompileGraph(const torch::jit::Module& mod, CompileSpec cfg) {
400392
torch::jit::Module new_mod(mod._ivalue()->name() + "_trt");
401393

402-
// GPU default WS size : 1 GB
403-
// Set WS = 256 Mb for Jetson nano/TX1 like platforms whose compute capability is 5.X.
404-
auto workspace_size = cfg.convert_info.engine_settings.workspace_size;
394+
// // GPU default WS size : 1 GB
395+
// // Set WS = 256 Mb for Jetson nano/TX1 like platforms whose compute capability is 5.X.
396+
// auto workspace_size = cfg.convert_info.engine_settings.workspace_size;
405397
auto device_spec = cfg.convert_info.engine_settings.device;
406398
auto cuda_device = runtime::CudaDevice(device_spec.gpu_id, device_spec.device_type);
407-
if (workspace_size == 0) {
408-
cfg.convert_info.engine_settings.workspace_size = GetRecommendedWorkspaceSize(cuda_device);
409-
}
399+
// if (workspace_size == 0) {
400+
// cfg.convert_info.engine_settings.workspace_size = GetRecommendedWorkspaceSize(cuda_device);
401+
// }
410402

411403
for (const torch::jit::Method& method : mod.get_methods()) {
412404
if (method.name().compare("forward") == 0) {
@@ -436,7 +428,11 @@ torch::jit::Module CompileGraph(const torch::jit::Module& mod, CompileSpec cfg)
436428
auto graph_and_mapping =
437429
ConstructFallbackGraph(new_mod, g->block(), input_ivalues_map, cfg, static_params, fallback_nodes);
438430
new_g = graph_and_mapping.first;
439-
LOG_INFO("Graph after Fallback: " << *new_g);
431+
// renaming the input name of graph after fallback to ensure pytorch deserialize it correctly
432+
for (size_t i = 0; i < new_g->inputs().size(); ++i) {
433+
new_g->inputs()[i]->setDebugName(std::string("input_") + std::to_string(i));
434+
}
435+
LOG_INFO(*new_g << "(GraphAfterFallback)");
440436

441437
// if there is no tensorrt engine self in fallback graph, there is no conversion, we just return the initial
442438
// module

core/conversion/conversion.cpp

+1-1
Original file line numberDiff line numberDiff line change
@@ -188,7 +188,7 @@ void AddInputs(
188188
ctx->input_is_dynamic = true;
189189
}
190190

191-
ctx->value_tensor_map[in] = trt_in;
191+
ctx->RecordNewITensor(in, trt_in);
192192
ctx->num_inputs += 1;
193193
}
194194

core/conversion/conversionctx/ConversionCtx.cpp

+28-6
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,11 @@ std::ostream& operator<<(std::ostream& os, const BuilderSettings& s) {
2020
<< "\n Debuggable Engine: " << s.debug \
2121
<< "\n GPU ID: " << s.device.gpu_id \
2222
<< "\n Allow GPU Fallback (if running on DLA): " << s.device.allow_gpu_fallback \
23-
<< "\n Min Timing Iterations: " << s.num_min_timing_iters \
2423
<< "\n Avg Timing Iterations: " << s.num_avg_timing_iters \
25-
<< "\n Max Workspace Size: " << s.workspace_size;
24+
<< "\n Max Workspace Size: " << s.workspace_size \
25+
<< "\n DLA SRAM Size: " << s.dla_sram_size \
26+
<< "\n DLA Local DRAM Size: " << s.dla_local_dram_size \
27+
<< "\n DLA Global DRAM Size: " << s.dla_global_dram_size;
2628

2729
os << "\n Device Type: " << s.device.device_type \
2830
<< "\n GPU ID: " << s.device.gpu_id;
@@ -104,9 +106,11 @@ ConversionCtx::ConversionCtx(BuilderSettings build_settings)
104106
cfg->setFlag(nvinfer1::BuilderFlag::kGPU_FALLBACK);
105107
}
106108

107-
cfg->setMinTimingIterations(settings.num_min_timing_iters);
108109
cfg->setAvgTimingIterations(settings.num_avg_timing_iters);
109-
cfg->setMaxWorkspaceSize(settings.workspace_size);
110+
if (settings.workspace_size != 0){
111+
cfg->setMemoryPoolLimit(nvinfer1::MemoryPoolType::kWORKSPACE, settings.workspace_size);
112+
}
113+
110114
cfg->setDefaultDeviceType(settings.device.device_type);
111115
cfg->setEngineCapability(settings.capability);
112116

@@ -120,6 +124,15 @@ ConversionCtx::ConversionCtx(BuilderSettings build_settings)
120124
settings.enabled_precisions.find(nvinfer1::DataType::kFLOAT) == settings.enabled_precisions.end(),
121125
"DLA supports only fp16 or int8 precision");
122126
cfg->setDLACore(settings.device.dla_core);
127+
if (settings.dla_sram_size != 1048576){
128+
cfg->setMemoryPoolLimit(nvinfer1::MemoryPoolType::kDLA_MANAGED_SRAM, settings.dla_sram_size);
129+
}
130+
if (settings.dla_local_dram_size != 1073741824){
131+
cfg->setMemoryPoolLimit(nvinfer1::MemoryPoolType::kDLA_LOCAL_DRAM, settings.dla_local_dram_size);
132+
}
133+
if (settings.dla_global_dram_size != 536870912){
134+
cfg->setMemoryPoolLimit(nvinfer1::MemoryPoolType::kDLA_GLOBAL_DRAM, settings.dla_global_dram_size);
135+
}
123136
}
124137
}
125138

@@ -130,8 +143,8 @@ ConversionCtx::~ConversionCtx() {
130143
}
131144

132145
nvinfer1::ITensor* ConversionCtx::AssociateValueAndTensor(const torch::jit::Value* value, nvinfer1::ITensor* tensor) {
133-
tensor->setName(value->debugName().c_str());
134-
this->value_tensor_map[value] = tensor;
146+
RecordNewITensor(value, tensor);
147+
135148
return tensor;
136149
}
137150

@@ -140,6 +153,15 @@ torch::jit::IValue* ConversionCtx::AssociateValueAndIValue(const torch::jit::Val
140153
return &this->evaluated_value_map[value];
141154
}
142155

156+
void ConversionCtx::RecordNewITensor(const torch::jit::Value* value, nvinfer1::ITensor* tensor) {
157+
value_tensor_map[value] = tensor;
158+
auto ret = seen_itensors.insert(tensor);
159+
if (!ret.second) {
160+
LOG_WARNING(
161+
"Trying to record the value " << value->debugName() << " with the ITensor " << tensor->getName() << " again.");
162+
}
163+
}
164+
143165
std::string ConversionCtx::SerializeEngine() {
144166
#if NV_TENSORRT_MAJOR > 7
145167
auto serialized_network = builder->buildSerializedNetwork(*net, *cfg);

core/conversion/conversionctx/ConversionCtx.h

+7-1
Original file line numberDiff line numberDiff line change
@@ -33,9 +33,11 @@ struct BuilderSettings {
3333
Device device;
3434
nvinfer1::EngineCapability capability = TRT_ENGINE_CAPABILITY_STANDARD;
3535
nvinfer1::IInt8Calibrator* calibrator = nullptr;
36-
uint64_t num_min_timing_iters = 2;
3736
uint64_t num_avg_timing_iters = 1;
3837
uint64_t workspace_size = 0;
38+
uint64_t dla_sram_size = 1048576;
39+
uint64_t dla_local_dram_size = 1073741824;
40+
uint64_t dla_global_dram_size = 536870912;
3941

4042
BuilderSettings() = default;
4143
BuilderSettings(const BuilderSettings& other) = default;
@@ -46,6 +48,7 @@ struct ConversionCtx {
4648
ConversionCtx(BuilderSettings settings);
4749
std::string SerializeEngine();
4850
nvinfer1::ITensor* AssociateValueAndTensor(const torch::jit::Value* value, nvinfer1::ITensor* tensor);
51+
void RecordNewITensor(const torch::jit::Value* value, nvinfer1::ITensor* tensor);
4952
torch::jit::IValue* AssociateValueAndIValue(const torch::jit::Value* value, torch::jit::IValue tensor);
5053
bool CheckLayerAddition(const torch::jit::Node* n);
5154

@@ -69,6 +72,9 @@ struct ConversionCtx {
6972

7073
std::unordered_map<const torch::jit::Value*, nvinfer1::ITensor*> value_tensor_map;
7174
std::unordered_map<const torch::jit::Value*, torch::jit::IValue> evaluated_value_map;
75+
76+
// record already named ITensors to prevent rewriting another name to the same tensor
77+
std::unordered_set<nvinfer1::ITensor*> seen_itensors;
7278
};
7379

7480
} // namespace conversion

0 commit comments

Comments
 (0)