TensorRT and DeepStream for MaskRCNN: A Brief Overview

11 min readDec 20, 2021

How lucky am I to have something that makes saying goodbye so hard ~Winnie the Pooh

The article is prone to weekly updates, subject to my developments.

The article serves as my experiments with TensorRT, DeepStream that I used along with some other amazing tools by NVIDIA. This is done in the context of Mask-RCNN and includes some of the red herrings and tested methods. So concisely, I had to run Mask-RCNN on my edge device (NVIDIA’s Jetson Xavier AGX) but, faster. Let me provide some references and words before starting to use them for granted for the rest of the article.

Introduction to Mask RCNN & TensorRT

Mask-RCNN

Mask RCNN is a general framework for object instance segmentation first given by a team of Facebook AI Research (FAIR) back in 2018. It provides object detection using bounding boxes and instance segmentation using masks. The original paper also provides the code used to build the SOTA model that still today, is one of the best among the single model instance segmentation frameworks. The authors used Facebook’s Detectron but as most outstanding papers, the code isn’t hackable. By hackable I mean, the code can’t easily be suited to our purpose. For e.g., Mask RCNN was originally benchmarked using COCO dataset which has 80 classes including humans, chair, bottles, cats, dogs, etc but if I need the model to also detect a custom class like an angel slug 🐌 , its not a straight forward and well-documented process.

Angel slug (giphy)

To overcome this, some amazing people at matterport modified the original source and released their version of Mask RCNN using Keras and TensorFlow. Since then, this repo has become the de-facto of the Mask RCNN users. It detailed out every single step required to customise the model and also documented the working behind it, excellently.

TensorRT

NVIDIA’s TensorRT is a tool/SDK for high performance deep learning inference. It tries to increase the inference throughput by optimising the model: simplify the model layers into simple instructions (Floating Point Operations or FLOPS) that can be distributed among the many CUDA cores/workers; NVIDIA Jetson devices have upto 256 cores, so they are designed specifically for TensorRT compatibility. It also opens the gateway to the entire optimisation stack of NVIDIA including DeepStream, Triton Inference Server, Riva, Merlin, polygraphy, etc; all paths to deploying an optimised model on NVIDIA devices involves or begins with TensorRT.

The repository for this project sees active development and we currently have TensorRT 8 as of writing this. Although the MaskRCNN samples haven’t got any updates. It boasts of supporting all popular frameworks like PyTorch, TensorFlow, ONNX, etc which means that users can easily import models trained using these frameworks into TensorRT.

Need for optimisation

The MaskRCNN model is based on ResNet101 and if you are aware of the 101 in its name, you know how compute intensive it can get. The original COCO based model took around an hour for training on an eight GPU machine. The same model has an inference time of ~400ms with the GPU. The official excerpts support this:

Although Mask R-CNN is fast, we note that our design is not optimized for speed, and better speed/accuracy tradeoffs could be achieved [21], e.g., by varying image sizes and proposal numbers, which is beyond the scope of this paper.

On the training part,

Mask R-CNN is also fast to train. Training with ResNet-50-FPN on COCO trainval35k takes 32 hours in our synchronized 8-GPU implementation (0.72s per 16- image mini-batch), and 44 hours with ResNet-101-FPN. In fact, fast prototyping can be completed in less than one day when training on the train set.

I don’t know how 44 hours is fast 😶‍🌫️but let’s focus on the 400ms inference time. A comparison is as follows:

We discuss the ~250ms yielding method in Method 1a: Using Uff parser. Note that while conducting above mentioned benchmark tests, the models were not deployed. The paper (as well as the Matterportrepo) also mentions a faster alternative to ResNet101: ResNet50. But believe me, the performance drastically degraded in my application.

Many computer vision applications include use of a monocular camera which can give you a minimum frame rate of 5–30 FPS. This implies that every real-time application that uses MaskRCNN, requires optimisation.

Optimisation Methods using TensorRT

1) Custom model trained using matterport

Most of us follow this path. Models trained using matterport’s mrcnn fork allows us to train a model using either ResNet50 or ResNet101; limited choice here. Both backbones require use of COCO pretrained weights, which are automatically downloaded during training. The dependencies are TensorFlow <= 1.15-gpu which is pretty outdated. The exported model is actually .h5 weights instead of common frozen graph or savedModel format or checkpoints.

Once we have the h5 weights, we have two options:

1.a) Using Uff parser (depreceated but works!)

The method is follows, distributed in 2 parts: inference and deployment. Most people may never need deployment:

Inference:  h5 -> Uff -> trt inference engine (.engine)
Deployment: Uff -> trt inference engine (.engine) -> DeepStream

The Inference includes converting the weights to Universal Framework Format (Uff). Uff helps break the model graph into layers that are further distributed among GPU cores to speed up computation. For projects that do not require deployment, the inference is same for an x86 (host) as well as Jetson devices (edge). The method requires outdated tools like Tensorflow 1.15 and Uff and also requires some patches to the TensorRT files. I have detailed out the steps in another article on Mask-RCNN Optimisation and Inference using TensorRT (TBD); part explains the process to obtain Uff model from the h5 weights and focuses on running inference on x86 host and Jetson.

1.b) Using ONNX parser (recommended but doesn’t work for Mask-RCNN currently)

Inference:  h5 -> pb -> onnx -> trt inference engine (.engine)
Deployment: Uff -> trt inference engine (.engine) -> DeepStream

The docs demonstrate this method for ResNet50 based models while ResNet101 based models are not supported in the versions of Kerasused in the demo. I retrained our model using ResNet50 but the problem persists. So, I found a kinda hack to obtain the frozen graph (.pb file) and documented the method in my article :Mask-RCNN h5 weights to frozen graph. This hack (leaves the model inputs and outputs dims unclear, leading to further problems when exporting the inference engine.

ONNX parser is unable to parse Input, Output layer attributes for this frozen graph

As an alternative, suggested by one of the tf-onnxdeveloper/maintainer, we could also obtain the savedModel from h5 and later ONNX could automatically obtain the frozen graph but as MaskRCNN is somewhat old (discussed in the last section of this article) and uses TensorFlow 1.x, we have no way to obtain savedModel format.

The pb to onnx conversion is simple:

!pip3 uninstall keras-nightly
!pip3 uninstall -y tensorflow
!pip3 install keras==2.1.6
!pip3 install tensorflow==1.15.0
!pip3 install h5py==2.10.0
!pip3 install onnx==1.10.1
!pip3 uninstall tf2onnx
!pip install git+https://github.com/onnx/tensorflow-onnx
!pip install -U keras2onnx pillow pycuda scikit-image# get sample model; replace with your model
!wget 'https://developer.download.nvidia.com/devblogs/semantic_segmentation.hdf5'import tensorflow as tf
from keras2onnx import convert_kerasdef keras_to_onnx(model, output_filename):
   onnx = convert_keras(model, output_filename)
   with open(output_filename, "wb") as f:
       f.write(onnx.SerializeToString())semantic_model = keras.models.load_model('/content/semantic_segmentation.hdf5')
keras_to_onnx(semantic_model, 'semantic_segmentation.onnx')

2) Custom model trained using NVIDIA Transfer learning toolkit (TLT)

The method is far more sophisticated and recommended by NVIDIA as well. It leverages NVIDIA’s Transfer Learning Toolkit and TAO (Transfer, Adapt and Optimise) toolkit. The tool(s) are meant to be the swiss-knife of ML model development lifecycle: supports train, evaluate, inference and export. All it takes is comfort with CLI and editing a lot of config txt files 🥲, but its worth it. A sample prototype for MaskRCNN:

$ tlt mask_rcnn <sub_task> <args_per_subtask>

Read more at official NVIDIA-metropolis page. The best part about it is, the method is tested to facilitate zero friction when moving from development to deployment. Inference engine can be easily exported using TLT but is exclusive to the device used to export the tlt engine with FP16, FP32 & INT8 precision options.

There are two caveats here: firstly, to the train sub_task of tlt requires that the data be in TFRecords format and one must figure out the correct way to do it for their (generally) large datasets; secondly, once the engine is exported, the deployment process is same as method 1: hail lord DeepStream 🤌🏽.

Deployment Methods using DeepStream

Here we discuss the methods that claim to be used to deploy the optimised Mask-RCNN model using either of the methods discussed above and my experiences with them. Note that the deployment requires an NVIDIA Jetson device. I used NVIDIA Jetson Xavier AGX for this. I use JetPack 4.5.1 image that comes pre-installed with Deepstream 5.1. This section will include the red-herrings I mentioned, so you don’t waste your time unlike me. But I think detailing the steps leading to the nowhere is good as someone may actually derive a working solution from them; I was able to hack one of the sample and create a custom input/output pipeline without the need of deepstream but that’s another story…

1. `deepstream_python_apps` samples (TL;DR doesn’t work)

NVIDIA provides a Python sample for Mask-RCNN deployment on Deepstream. The setup is as follows:

On running the sample

$ python3 deepstream_segmentation.py custom_config.txt /home/virus/Desktop/optimisation/short.jpg /home/virus/Desktop/results1

I got improper masks that were a simple grey line 😑. It turned out that the sample was designed for UNET and was not meant for Mask-RCNN.

2. `deepstream-mrcnn-app` sample

This sample builds on top of the deepstream-test1sample to demonstrate how to use “nvmsgconv” and “nvmsgbroker” plugins in the pipeline. So, we test if all the components in the Deepstream pipeline work fine. To do this, the deepstream-test1 sample must be compiled and run first.

Run deepstream-test1

Generate engine file for custom model

Run trtexec to export engine

cd /usr/src/tensorrt/bin
sudo ./trtexec --uff=/home/virus/Desktop/optimisation/res101-holygrail-ep26.uff --uffInput=input_image,3,1024,1024 --output=mrcnn_detection,mrcnn_mask/Sigmoid --fp16 --saveEngine=res101-holygrail-ep26-fp16.engine

cp res101-holygrail-ep26-fp16.engine ~/Desktop/optimisation/

Install rabbitMQ/AMQ

The sample pipeline is meant to use a communication protocol. So, the user must use one of the three options available: rabbitMQ/AMQ, Kafka, Azure IOT to get the inference results.

Now run sample server on a new terminal

## Open new terminal and enable & start logger
sudo chmod u+x /opt/nvidia/deepstream/deepstream-5.1/sources/tools/nvds_logger/setup_nvds_logger.sh
sudo /opt/nvidia/deepstream/deepstream-5.1/sources/tools/nvds_logger/setup_nvds_logger.sh
## Setup AMQ config for mrcnn (Edit `cfg_amqp.txt`) if you need to
cd /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-mrcnn-test
sudo nano cfg_amqp.txt
## Its best to leave the file unchanged for now.

Make changes in the config file dsmrcnn_pgie_config.txtThe config file contains the paths to the models files. These files are missing in JetPack 4.5.1. So, we need to get them first; I’ll demonstrate this for the sample COCO model.

cd /opt/nvidia/deepstream/deepstream/samples/models/tlt_pretrained_models
wget https://nvidia.box.com/shared/static/8k0zpe9gq837wsr0acoy4oh3fdf476gq.zip -O models.zip
sudo unzip models.zip
sudo cp -r models/mrcnn/ .

We don’t need to edit dsmrcnn_pgie_config.txt now as we specifically placed all files in the paths mentioned in the config file.

Obtain engine file using trtexec as shown above and move it to the path specified in the config file.

Build the sample:

cd /opt/nvidia/deepstream/deepstream-5.1/sources/apps/sample_apps/deepstream-mrcnn-test
sudo CUDA_VER=10.2 make clean
sudo CUDA_VER=10.2 make

CUDA version can be found as

cat /usr/local/cuda/version.txt

Infer: The sample uses gstreamer plugins to decode onlyh264 video input. So, any test input must first be converted into h264 format. For demonstration purpose, we can use the sample input files available in /opt/nvidia/deepstream/deepstream/samples/streams.

deepstream-mrcnn-app -i /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264 -p /opt/nvidia/deepstream/deepstream/lib/libnvds_amqp_proto.so  --no-display -c cfg_amqp.txt

The inference fails as the sample is hard-coded for an ANPR application 😑. I will try modifying the source and report here.

3. Using PeopleSegNet sample from TAO-apps repo

Counterintuitively, the TAO repo has a sample for PeopleSegNet which is based on the same architecture as MaskRCNN 😶. The page on TAO-toolkit explains this, although it was difficult to find this fact.

Clone the repo

$ export DS_SRC_PATH=/opt/nvidia/deepstream/deepstream-5.1/
# Jetson has a persistent inability to fetch environment variables so ^ may not be useful.
# We have a temporary solution by explicitly mentioning the variables as flags during compilation.
$ git clone https://github.com/NVIDIA-AI-IOT/deepstream_4.x_apps.git

The application doesn’t work out-of-the-box and requires changes to the Makefile & config file pgie_mrcnn_uff_config:

diff --git a/Makefile b/Makefile
index 80e6502..f11bfac 100644
--- a/Makefile
+++ b/Makefile
@@ -13,7 +13,7 @@ APP:= deepstream-custom TARGET_DEVICE = $(shell gcc -dumpmachine | cut -f1 -d -)-NVDS_VERSION:=4.0
+NVDS_VERSION:=5.1 LIB_INSTALL_DIR?=/opt/nvidia/deepstream/deepstream-$(NVDS_VERSION)/lib/diff --git a/nvdsinfer_customparser_mrcnn_uff/nvdsinfer_custombboxparser_mrcnn_uff.cpp b/nvdsinfer_customparser_mrcnn_uff/nvdsinfer_custombboxparser_mrcnn_uff.cpp
index d8ac0d4..90ceab6 100644
--- a/nvdsinfer_customparser_mrcnn_uff/nvdsinfer_custombboxparser_mrcnn_uff.cpp
+++ b/nvdsinfer_customparser_mrcnn_uff/nvdsinfer_custombboxparser_mrcnn_uff.cpp
@@ -28,7 +28,7 @@ static const int DETECTION_MAX_INSTANCES = 100;
 static const int NUM_CLASSES = 1 + 80; // COCO has 80 classes static const int MASK_POOL_SIZE = 14;
-static const nvinfer1::DimsCHW INPUT_SHAPE{3, 1024, 1024};
+static const nvinfer1::Dims3 INPUT_SHAPE{3, 1024, 1024};
 //static const Dims2 MODEL_DETECTION_SHAPE{DETECTION_MAX_INSTANCES, 6};
 //static const Dims4 MODEL_MASK_SHAPE{DETECTION_MAX_INSTANCES, NUM_CLASSES, 28, 28};diff --git a/pgie_mrcnn_uff_config.txt b/pgie_mrcnn_uff_config.txt
index b169d1d..5422121 100644
--- a/pgie_mrcnn_uff_config.txt
+++ b/pgie_mrcnn_uff_config.txt
@@ -50,7 +50,7 @@ offsets=103.939;116.779;123.68
 model-color-format=1
 labelfile-path=./nvdsinfer_customparser_mrcnn_uff/mrcnn_labels.txt
 uff-file=./mrcnn_nchw.uff
-model-engine-file=./mrcnn_nchw.uff_b1_fp32.engine
+model-engine-file=./mrcnn_nchw.uff_b1_gpu0_fp32.engine
 uff-input-dims=3;1024;1024;0
 uff-input-blob-name=input_image
 batch-size=1

Notice that the exported engine has been renamed to mrcnn_nchw.uff_b1_gpu0_fp32.engine. The example is for the 80 class COCO model (actually the entire article is…) but for a custom model, nvdsinfer_customparser_mrcnn_uff/mrcnn_labels.txt & pgie_mrcnn_uff_config.txtfile must be adapted. The sample, (like the deepstream-mrcnn-app sample), requires a monitor; a headless (SSH-based) system requires passing additional arguments to infer command but I will not discuss that here. Infer as:

$ cd deepstream_4.x_apps/nvdsinfer_customparser_mrcnn_uff/
$ CUDA_VER=10.2 DS_SRC_PATH=/opt/nvidia/deepstream/deepstream-5.1/ make
$ cd ../
$ make
$ ./deepstream-custom pgie_mrcnn_uff_config.txt /opt/nvidia/deepstream/deepstream-5.1/samples/streams/sample_720p.h264

Notice the 5.1 signifies our DeepStream version. The application should open up frames with overlayed masks as results if everything went well. Well, it doesn’t end here, to test the inference engine on a custom dataset, you must first convert the test dataset into a h264 video; to infer using an image input, or any other input (a camera maybe…), you need to customise the gstreamer pipeline. The /opt/nvidia/deepstream/deepstream-5.1/sources/apps/sample_apps/deepstream-image-decode-test can be used as a sample for images as gstreamer input.

Remarks

Of all the above mentioned and some not, this is what I concluded after months of experimenting with Mask-RCNN optimisation using NVIDIA’s tools:

Mask-RCNN itself is turning old (the paper released on 20 March 2017) although it still remains one of the OGs of Instance segmentation 🦹‍♂️. Its popularity can be attributed to its ease of training custom models; newer models/methods have a complex training process/architecture, mostly explained in their paper but not assisted with the used code. This makes reproducing the results difficult for someone who just found out about the paper. In many cases, even if the repository is provided, that contains the code used to evaluate results mentioned in the paper, the authors are too busy to explain their code or how they prepared their training data.
NVIDIA is focused on newer models and architectures instead of making the process of optimising the Mask-RCNN model concise and well documented, they are moving to development on TAO and Triton Server, etc. They still haven’t yet solved problems with ONNX parser for Mask-RCNN. This maybe good/bad depending on the willingness of the user to train their model from scratch using TAO-toolkit. What if the sample COCO-based model already had their desired classes and they only needed to deploy it ASAP ? So,
Either train your model from scratch using TLT. This is one method we discussed above and seems like a well documented method by NVIDIA. This may become the de-facto for MaskRCNN optimisation.
Or create your own pipeline by hacking the Uff-parser-method This method has been discussed in the article(TBD) and is a tested option for anybody who wants to use their h5 weights instead of doing the training again (using TLT/TAO). As a depreceated method (as NVIDIA says), even when I managed to do it, the input pipeline still remains a pain; Deepstream and gstreamer need lot more efforts. Bypassing them means compromising with the throughput, while retaining the performance.

If I could help you, you can also show gratitude with some NFTs/crypto 🤠

SOL wallet: C1Rz6QLkg6WSWvYJdbcdyM1ah5fYkWU5TDecMM6CirX2

BSC wallet: 0x47acC4B652166AE55fb462e4cD7DD26eFa97Da04

ETH wallet: 0x47acC4B652166AE55fb462e4cD7DD26eFa97Da04