NVIDIA Triton Inference Server — Serve DL models like a pro

6 min readSep 19, 2022

Deep learning model deployment, on a scalable and optimised infrastructure be it GPU or CPU and streamlining the whole process can be quite challenging. Especially if you are new to this. It could be quite daunting to get a performant service for your models, and even more so if you are trying to build out the components individually. This is where Triton fits in. If you haven’t heard about Triton before you probably should and this article will help you in getting up to speed with it.

I could talk a lot more about Triton, but there’s plenty of information out there already to convince you if are not already. Here’s a introduction to Triton from Nvidia. I want to focus more on helping you or your team set up a simple instance of Triton inference server from scratch.

Prerequisites

Virtual Machine or even a local PC/laptop (with or without a GPU) (We will focus on an Ubuntu 22.04, AWS (p3.2xlarge) VM which comes with a V100 GPU)
Basics of Docker
Python — In this walkthrough we will mainly use python for inference on the deployed model.
Any pre-trained deep learning model (Optional, we can work with some example models configured in the Official Triton Repo)

[Optional] GPU Setup

If you plan to use a GPU, then we should set up NVIDIA Container Toolkit and NVIDIA GPU Drivers along with CUDA before we can access the GPU with docker.

Follow this to get the drivers and the toolkit along with Docker if you dont have docker installed

sudo apt update
sudo apt install ubuntu-drivers
ubuntu-drivers devices

Output to the above should look something like this

# Automatically pick the recommended version to install
sudo ubuntu-drivers autoinstall# If you want a specific version listed use this
sudo apt install nvidia-driver-470# Reboot once you install
sudo reboot

Install CUDA

This goes without saying that you must have an NVIDIA GPU with the drivers installed. If you have been following along then you must be all good to set up CUDA.

sudo apt update
sudo apt install build-essential
sudo apt install nvidia-cuda-toolkit# To confirm CUDA is working run this
nvcc --version

Install NVIDIA Container Tools with Docker

Now time for us to set up the last part.

# Install docker
curl https://get.docker.com | sh \
  && sudo systemctl --now enable docker# I know I know but the whole thing, yes?
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.listsudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker# To test run this
sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

Now then, we have the foundation. Lets build on top of this

Setup a model Repository and Start Triton

Let’s set up a simple local repository. Essentially a folder containing our models. Ideally we can set this up in S3 or any other cloud storage as well. More on that on a later one. For now we will create a an empty folder.

mkdir models

Starting triton

Triton server versions are already available as pre-built images. Given we have docker installed we can pull the image using the following command

docker pull nvcr.io/nvidia/tritonserver:22.01-py3

We will use the version 22.01 here. But you can replace it with the version you are looking for, obviously. Now lets start the image with a few parameters

docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ~/models:/models nvcr.io/nvidia/tritonserver:22.01-py3 tritonserver --model-repository=/models --model-control-mode=poll --repository-poll-secs 30

Parameters Explained

— model-repository : Model repository folder within the container. You can also see that the “models” folder we created is mapped into the container “/models”. So all we have to do now is to dump our models into this folder and Voilà!
— model-control-mode : There are a few behaviours in Triton that we can control. One of them being when do you want to load the models. Is it only during the startup? Or do you wanna observe a directory for changes? Setting this to “poll” will literally poll the model repository for changes every x seconds. This x configured using the next argument.
— repository-poll-secs : This has to be a nonzero value in seconds that is specified along with poll. We are setting this as 30s for now.

If everything succeeded you should see something like this

Now that we have our “Empty model repository” setup, lets get down to the final few steps

# Run this to check the health of our server
curl -v http://<your-vm-ip-here>:8000/v2/health/ready

Didn’t think I was gonna show you my IP?

Adding a Model

The models are expected in a particular format. Triton supports multiple runtimes including Tensorflow, TorchScript and ONNX. Read more on this here. Generally speaking you want you model files in the following format.

|<model-repository-path>/
|   |<model-name>/
|   |   config.pbtxt
|   |   |1/
|   |   |  |model.xyz# "1" here corresponds to a version number and xyz the model extension depending on the framework

Now the config file has to be written in a specific way. If you are familiar with ProtocolBuffers then it would be obvious to you, but if not it pretty much is like a fancy text file. To write one for your model, refer to this. There are also examples of these files in here. Once this is setup, your folder can be moved into out “models” folder and our server will pick it up.

Inference

Okay! Now we have our model running and we wanna infer on this thing. See how this basic setup is performing. Of course, there will be plenty more optimisations coming on top of this to get a better performance. Thats what Triton is all about. But for now, let’s make it work.

# Install Triton Client in python
pip install 'tritonclient[all]'import tritonclient.http as httpclient
from tritonclient.utils import InferenceServerExceptiontriton_client = httpclient.InferenceServerClient(
                url='<SERVER-IP-HERE>:8000')def test_infer(model_name,
               input0_data,
               input1_data):
    inputs = []
    outputs = []
    inputs.append(httpclient.InferInput('INPUT_LAYER_1_NAME', list(input0_data.shape), "INT64"))
    inputs.append(httpclient.InferInput('INPUT_LAYER_2_NAME', list(input1_data.shape), "INT64"))# Initialize the data
    inputs[0].set_data_from_numpy(input0_data)
    inputs[1].set_data_from_numpy(input1_data)outputs.append(httpclient.InferRequestedOutput('model_output',
                                                 binary_data=False))
    results = triton_client.infer(
        model_name,
        inputs,
        outputs=outputs
)return results

The function above is just one of the many examples provided here, adapted slightly. Feel free to look through and find what you might be looking for.

Thats it for this one. There’s plenty more we could do but hopefully this provided the basic idea of Triton and just how powerful it really is. Hold that thought. Be right back with an advanced Triton guide!