|
| 1 | +Deploying a Torch-TensorRT model (to Triton) |
| 2 | +============================================ |
| 3 | + |
| 4 | +Optimization and deployment go hand in hand in a discussion about Machine |
| 5 | +Learning infrastructure. Once network level optimzation are done |
| 6 | +to get the maximum performance, the next step would be to deploy it. |
| 7 | + |
| 8 | +However, serving this optimized model comes with it's own set of considerations |
| 9 | +and challenges like: building an infrastructure to support concorrent model |
| 10 | +executions, supporting clients over HTTP or gRPC and more. |
| 11 | + |
| 12 | +The `Triton Inference Server <https://github.com/triton-inference-server/server>`__ |
| 13 | +solves the aforementioned and more. Let's discuss step-by-step, the process of |
| 14 | +optimizing a model with Torch-TensorRT, deploying it on Triton Inference |
| 15 | +Server, and building a client to query the model. |
| 16 | + |
| 17 | +Step 1: Optimize your model with Torch-TensorRT |
| 18 | +----------------------------------------------- |
| 19 | + |
| 20 | +Most Torch-TensorRT users will be familiar with this step. For the purpose of |
| 21 | +this demonstration, we will be using a ResNet50 model from Torchhub. |
| 22 | + |
| 23 | +Let’s first pull the `NGC PyTorch Docker container <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch>`__. You may need to create |
| 24 | +an account and get the API key from `here <https://ngc.nvidia.com/setup/>`__. |
| 25 | +Sign up and login with your key (follow the instructions |
| 26 | +`here <https://ngc.nvidia.com/setup/api-key>`__ after signing up). |
| 27 | + |
| 28 | +:: |
| 29 | + |
| 30 | + # <xx.xx> is the yy:mm for the publishing tag for NVIDIA's Pytorch |
| 31 | + # container; eg. 22.04 |
| 32 | + |
| 33 | + docker run -it --gpus all -v ${PWD}:/scratch_space nvcr.io/nvidia/pytorch:<xx.xx>-py3 |
| 34 | + cd /scratch_space |
| 35 | + |
| 36 | +Once inside the container, we can proceed to download a ResNet model from |
| 37 | +Torchhub and optimize it with Torch-TensorRT. |
| 38 | + |
| 39 | +:: |
| 40 | + |
| 41 | + import torch |
| 42 | + import torch_tensorrt |
| 43 | + torch.hub._validate_not_a_forked_repo=lambda a,b,c: True |
| 44 | + |
| 45 | + # load model |
| 46 | + model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True).eval().to("cuda") |
| 47 | + |
| 48 | + # Compile with Torch TensorRT; |
| 49 | + trt_model = torch_tensorrt.compile(model, |
| 50 | + inputs= [torch_tensorrt.Input((1, 3, 224, 224))], |
| 51 | + enabled_precisions= { torch.half} # Run with FP32 |
| 52 | + ) |
| 53 | + |
| 54 | + # Save the model |
| 55 | + torch.jit.save(trt_model, "model.pt") |
| 56 | + |
| 57 | +After copying the model, exit the container. The next step in the process |
| 58 | +is to set up a Triton Inference Server. |
| 59 | + |
| 60 | +Step 2: Set Up Triton Inference Server |
| 61 | +-------------------------------------- |
| 62 | + |
| 63 | +If you are new to the Triton Inference Server and want to learn more, we |
| 64 | +highly recommend to checking our `Github |
| 65 | +Repository <https://github.com/triton-inference-server>`__. |
| 66 | + |
| 67 | +To use Triton, we need to make a model repository. A model repository, as the |
| 68 | +name suggested, is a repository of the models the Inference server hosts. While |
| 69 | +Triton can serve models from multiple repositories, in this example, we will |
| 70 | +discuss the simplest possible form of the model repository. |
| 71 | + |
| 72 | +The structure of this repository should look something like this: |
| 73 | + |
| 74 | +:: |
| 75 | + |
| 76 | + model_repository |
| 77 | + | |
| 78 | + +-- resnet50 |
| 79 | + | |
| 80 | + +-- config.pbtxt |
| 81 | + +-- 1 |
| 82 | + | |
| 83 | + +-- model.pt |
| 84 | + |
| 85 | +There are two files that Triton requires to serve the model: the model itself |
| 86 | +and a model configuration file which is typically provided in ``config.pbtxt``. |
| 87 | +For the model we prepared in step 1, the following configuration can be used: |
| 88 | + |
| 89 | +:: |
| 90 | + |
| 91 | + name: "resnet50" |
| 92 | + platform: "pytorch_libtorch" |
| 93 | + max_batch_size : 0 |
| 94 | + input [ |
| 95 | + { |
| 96 | + name: "input__0" |
| 97 | + data_type: TYPE_FP32 |
| 98 | + dims: [ 3, 224, 224 ] |
| 99 | + reshape { shape: [ 1, 3, 224, 224 ] } |
| 100 | + } |
| 101 | + ] |
| 102 | + output [ |
| 103 | + { |
| 104 | + name: "output__0" |
| 105 | + data_type: TYPE_FP32 |
| 106 | + dims: [ 1, 1000 ,1, 1] |
| 107 | + reshape { shape: [ 1, 1000 ] } |
| 108 | + } |
| 109 | + ] |
| 110 | + |
| 111 | +The ``config.pbtxt`` file is used to describe the exact model configuration |
| 112 | +with details like the names and shapes of the input and output layer(s), |
| 113 | +datatypes, scheduling and batching details and more. If you are new to Triton, |
| 114 | +we highly encourage you to check out this `section of our |
| 115 | +documentation <https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md>`__ |
| 116 | +for more details. |
| 117 | + |
| 118 | +With the model repository setup, we can proceed to launch the Triton server |
| 119 | +with the docker command below. Refer `this page <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver>`__ for the pull tag for the container. |
| 120 | + |
| 121 | +:: |
| 122 | + |
| 123 | + # Make sure that the TensorRT version in the Triton container |
| 124 | + # and TensorRT version in the environment used to optimize the model |
| 125 | + # are the same. |
| 126 | + |
| 127 | + docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /full/path/to/the_model_repository/model_repository:/models nvcr.io/nvidia/tritonserver:<xx.yy>-py3 tritonserver --model-repository=/models |
| 128 | + |
| 129 | +This should spin up a Triton Inference server. Next step, building a simple |
| 130 | +http client to query the server. |
| 131 | + |
| 132 | +Step 3: Building a Triton Client to Query the Server |
| 133 | +---------------------------------------------------- |
| 134 | + |
| 135 | +Before proceeding, make sure to have a sample image on hand. If you don't |
| 136 | +have one, download an example image to test inference. In this section, we |
| 137 | +will be going over a very basic client. For a variety of more fleshed out |
| 138 | +examples, refer to the `Triton Client Repository <https://github.com/triton-inference-server/client/tree/main/src/python/examples>`__ |
| 139 | + |
| 140 | +:: |
| 141 | + |
| 142 | + wget -O img1.jpg "https://www.hakaimagazine.com/wp-content/uploads/header-gulf-birds.jpg" |
| 143 | + |
| 144 | +We then need to install dependencies for building a python client. These will |
| 145 | +change from client to client. For a full list of all languages supported by Triton, |
| 146 | +please refer to `Triton's client repository <https://github.com/triton-inference-server/client>`__. |
| 147 | + |
| 148 | +:: |
| 149 | + |
| 150 | + pip install torchvision |
| 151 | + pip install attrdict |
| 152 | + pip install nvidia-pyindex |
| 153 | + pip install tritonclient[all] |
| 154 | + |
| 155 | +Let's jump into the client. Firstly, we write a small preprocessing function to |
| 156 | +resize and normalize the query image. |
| 157 | + |
| 158 | +:: |
| 159 | + |
| 160 | + import numpy as np |
| 161 | + from torchvision import transforms |
| 162 | + from PIL import Image |
| 163 | + import tritonclient.http as httpclient |
| 164 | + from tritonclient.utils import triton_to_np_dtype |
| 165 | + |
| 166 | + # preprocessing function |
| 167 | + def rn50_preprocess(img_path="img1.jpg"): |
| 168 | + img = Image.open(img_path) |
| 169 | + preprocess = transforms.Compose([ |
| 170 | + transforms.Resize(256), |
| 171 | + transforms.CenterCrop(224), |
| 172 | + transforms.ToTensor(), |
| 173 | + transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), |
| 174 | + ]) |
| 175 | + return preprocess(img).numpy() |
| 176 | + |
| 177 | + transformed_img = rn50_preprocess() |
| 178 | + |
| 179 | +Building a client requires three basic points. Firstly, we setup a connection |
| 180 | +with the Triton Inference Server. |
| 181 | + |
| 182 | +:: |
| 183 | + |
| 184 | + # Setting up client |
| 185 | + client = httpclient.InferenceServerClient(url="localhost:8000") |
| 186 | + |
| 187 | +Secondly, we specify the names of the input and output layer(s) of our model. |
| 188 | + |
| 189 | +:: |
| 190 | + |
| 191 | + inputs = httpclient.InferInput("input__0", transformed_img.shape, datatype="FP32") |
| 192 | + inputs.set_data_from_numpy(transformed_img, binary_data=True) |
| 193 | + |
| 194 | + outputs = httpclient.InferRequestedOutput("output__0", binary_data=True, class_count=1000) |
| 195 | + |
| 196 | +Lastly, we send an inference request to the Triton Inference Server. |
| 197 | + |
| 198 | +:: |
| 199 | + |
| 200 | + # Querying the server |
| 201 | + results = client.infer(model_name="resnet50", inputs=[inputs], outputs=[outputs]) |
| 202 | + inference_output = results.as_numpy('output__0') |
| 203 | + print(inference_output[:5]) |
| 204 | + |
| 205 | +The output of the same should look like below: |
| 206 | + |
| 207 | +:: |
| 208 | + |
| 209 | + [b'12.468750:90' b'11.523438:92' b'9.664062:14' b'8.429688:136' |
| 210 | + b'8.234375:11'] |
| 211 | + |
| 212 | +The output format here is ``<confidence_score>:<classification_index>``. |
| 213 | +To learn how to map these to the label names and more, refer to our |
| 214 | +`documentation <https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_classification.md>`__. |
0 commit comments