Skip to content

Fix memory leaks #2526

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 18, 2023
Merged

Fix memory leaks #2526

merged 1 commit into from
Dec 18, 2023

Conversation

gcuendet
Copy link
Contributor

Description

Converting a torchscript with torch-tensorRT shows large memory leaks. A simple way to reproduce the leaks is provided in this script:

import numpy as np
import torch_tensorrt as trt
import torch
import torchvision
import psutil
import gc


if __name__ == "__main__":
    network = torchvision.models.mobilenet_v2(pretrained=True)
    network.eval().cuda()
    torch_s = torch.jit.script(network)

    compile_settings = {
        "inputs": [
            trt.Input([1, 3, 224, 224])
        ],
        "enabled_precisions": {torch.float32},
    }
    output_path = "/tmp/trt.ts"

    for _ in range(3):
        print(f"Used Virtual Memory: {psutil.virtual_memory().used / (1024*1024)}")
        trt_ts_module = trt.compile(torch_s, **compile_settings)
        torch.jit.save(trt_ts_module, output_path)

        del trt_ts_module
        gc.collect()

Running this script prints the used memory for each loop, which increases steadily by 35MB - 45MB per loop, where it should not increase, since all objects from the loop are deleted.

Running the small reproduction script with Valgrind, using the following command:

valgrind --leak-check=full python leak_mem.py

shows the following pretty important (42MB) possible losses:

==1163== 42,470,332 bytes in 3 blocks are possibly lost in loss record 70,886 of 70,887
==1163==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==1163==    by 0x2EAF9ADE: ??? (in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8.5.3)
==1163==    by 0x2F00444B: ??? (in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8.5.3)
==1163==    by 0x12D2ED492: torch_tensorrt::core::conversion::ConversionCtx::SerializeEngine[abi:cxx11]() (in /usr/local/lib/python3.8/dist-packages/torch_tensorrt/lib/libtorchtrt.so)
==1163==    by 0x12D241E30: torch_tensorrt::core::conversion::ConvertBlockToEngine[abi:cxx11](torch::jit::Block const*, torch_tensorrt::core::conversion::ConversionInfo, std::map<torch::jit::Value*, c10::IValue, std::less<torch::jit::Value*>, std::allocator<std::pair<torch::jit::Value* const, c10::IValue> > >&) (in /usr/local/lib/python3.8/dist-packages/torch_tensorrt/lib/libtorchtrt.so)
==1163==    by 0x12D1F75A2: torch_tensorrt::core::CompileGraph(torch::jit::Module const&, torch_tensorrt::core::CompileSpec) (in /usr/local/lib/python3.8/dist-packages/torch_tensorrt/lib/libtorchtrt.so)
==1163==    by 0x12D05AC0B: torch_tensorrt::pyapi::CompileGraph(torch::jit::Module const&, torch_tensorrt::pyapi::CompileSpec&) (torch_tensorrt_py.cpp:155)
==1163==    by 0x12D08806E: void pybind11::cpp_function::initialize<torch::jit::Module (*&)(torch::jit::Module const&, torch_tensorrt::pyapi::CompileSpec&), torch::jit::Module, torch::jit::Module const&, torch_tensorrt::pyapi::CompileSpec&, pybind11::name, pybind11::scope, pybind11::sibling, char [128]>(torch::jit::Module (*&)(torch::jit::Module const&, torch_tensorrt::pyapi::CompileSpec&), torch::jit::Module (*)(torch::jit::Module const&, torch_tensorrt::pyapi::CompileSpec&), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [128])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) (cast.h:1439)
==1163==    by 0x12D08AFA8: pybind11::cpp_function::dispatcher(_object*, _object*, _object*) (pybind11.h:929)
==1163==    by 0x5F5B38: PyCFunction_Call (in /usr/bin/python3.8)
==1163==    by 0x5F6705: _PyObject_MakeTpCall (in /usr/bin/python3.8)
==1163==    by 0x571142: _PyEval_EvalFrameDefault (in /usr/bin/python3.8)
==1163==
==1163== 42,470,332 bytes in 3 blocks are possibly lost in loss record 70,887 of 70,887
==1163==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==1163==    by 0x2EAF9ADE: ??? (in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8.5.3)
==1163==    by 0x2F7B03DF: ??? (in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8.5.3)
==1163==    by 0x12D2139BC: std::_Function_handler<void (std::vector<c10::IValue, std::allocator<c10::IValue> >&), torch::jit::Function* torch::class_<torch_tensorrt::core::runtime::TRTEngine>::defineMethod<torch_tensorrt::core::runtime::(anonymous namespace)::{lambda(c10::intrusive_ptr<torch_tensorrt::core::runtime::TRTEngine, c10::detail::intrusive_target_default_null_type<torch_tensorrt::core::runtime::TRTEngine> > const&)#1}>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, torch_tensorrt::core::runtime::(anonymous namespace)::{lambda(c10::intrusive_ptr<torch_tensorrt::core::runtime::TRTEngine, c10::detail::intrusive_target_default_null_type<torch_tensorrt::core::runtime::TRTEngine> > const&)#1}, std::allocator<char>, std::initializer_list<torch::arg>)::{lambda(std::vector<c10::IValue, std::allocator<c10::IValue> >&)#1}>::_M_invoke(std::_Any_data const&, std::vector<c10::IValue, std::allocator<c10::IValue> >&) (in /usr/local/lib/python3.8/dist-packages/torch_tensorrt/lib/libtorchtrt.so)
==1163==    by 0xC64E5EF6: torch::jit::Pickler::pushIValueImpl(c10::IValue const&) (in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
==1163==    by 0xC64E641A: torch::jit::Pickler::pushIValue(c10::IValue const&) (in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
==1163==    by 0xC64E612E: torch::jit::Pickler::pushIValueImpl(c10::IValue const&) (in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
==1163==    by 0xC64E65B2: torch::jit::Pickler::pushIValue(c10::IValue const&) (in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
==1163==    by 0xC694D3BD: torch::jit::ScriptModuleSerializer::writeArchive(c10::IValue const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) (in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
==1163==    by 0xC695043A: torch::jit::ScriptModuleSerializer::serialize(torch::jit::Module const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&, bool, bool) (in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
==1163==    by 0xC695154C: torch::jit::ExportModule(torch::jit::Module const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&, bool, bool, bool) (in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
==1163==    by 0xC13976B4: void pybind11::cpp_function::initialize<torch::jit::initJitScriptBindings(_object*)::{lambda(torch::jit::Module&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)#25}, void, torch::jit::Module&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v>(torch::jit::initJitScriptBindings(_object*)::{lambda(torch::jit::Module&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)#25}&&, void (*)(torch::jit::Module&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) (in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
==1163==

This points to two places,

  1. a function called torch_tensorrt::core::conversion::ConversionCtx::SerializeEngine (defined in core/conversion/conversionctx/ConversionCtx.cpp)
  2. a class called torch::class_<torch_tensorrt::core::runtime::TRTEngine> (defined in core/runtime/register_jit_hooks.cpp)

In these two places, pointers to TensorRT objects are obtained:

  1. A IHostMemory* raw pointer is returned by nvinfer1::IBuilder::buildSerializedNetwork(...)
  2. A IHostMemory* raw pointer is returned by nvinfer1::ICudaEngine::serialize()

This PR adds missing wrapping of raw pointers in smart pointers, so that the destructors of the underlying TensorRT objects are called properly, thus effectively releasing that host memory.

Type of change

Please delete options that are not relevant and/or add your own.

  • Bug fix (non-breaking change which fixes an issue)

Checklist:

  • My code follows the style guidelines of this project (You can use the linters)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas and hacks
  • I have made corresponding changes to the documentation
  • I have added tests to verify my fix or my feature
  • New and existing unit tests pass locally with my changes
  • I have added the relevant labels to my PR in so that relevant reviewers are notified

Add missing wrapping of raw pointers in smart pointers, so that the
destructors of the underlying TensorRT objects are called properly

Signed-off-by: Gabriel Cuendet <[email protected]>
@github-actions github-actions bot added component: conversion Issues re: Conversion stage component: core Issues re: The core compiler component: runtime labels Dec 11, 2023
@github-actions github-actions bot requested a review from narendasan December 11, 2023 14:06
@gcuendet
Copy link
Contributor Author

Looking quickly at the rest of the code, the TRTEngine::get_engine_layer_info function implemented in `core/core/TRTEngine.cpp might be another place where such a wrapping of a TensorRT raw pointer into a smart pointer is missing:

std::string TRTEngine::get_engine_layer_info() {
  auto inspector = cuda_engine->createEngineInspector(); // <-- The object pointed to by inspector never gets deleted
  return inspector->getEngineInformation(nvinfer1::LayerInformationFormat::kJSON);
}

@narendasan narendasan requested a review from gs-olive December 12, 2023 18:07
Copy link
Collaborator

@narendasan narendasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks for the patch, just want @gs-olive to take a look as well

@gs-olive
Copy link
Collaborator

This looks good! I do still see the memory footprint increasing from run to run, but the definitely lost field in the valgrind leak summary decreases substantially, which is great to see. I am also using the debug build which could affect the memory metrics here.

@narendasan narendasan merged commit e0b3fe1 into pytorch:main Dec 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed component: conversion Issues re: Conversion stage component: core Issues re: The core compiler component: runtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants