How to Improve Performance of Object Detection Model Deployed with VirtualMachine

Hi:

We are deploying a SSD model with virtual machine. The backbone of the model is offloaded to Tensorrt while other operators are left to CPU, which seems to be fastest way within different kinds of target allocation strategies.

The mean inference time is 115ms, while the Tensorflow takes less than 80ms even Tensorrt is not used. When we use the VirtualMachineProfiler to profile the compiled model, the several InvokedOp that takes most of the inference time are

#OpName                       	#InvokeCount	#Duration(us): Sum/Mean/Min/Max
tensorrt_2                    	1         	53173.5/53173.5/53173.5/53173.5
fused_vision_non_max_suppression	90        	22677.2/251.968/230.512/301.215
fused_vision_get_valid_counts 	90        	21406.4/237.849/207.425/256.47
fused_expand_dims_concatenate_expand_dims	90        	6885.56/76.5063/68.671/87.876

The nms and get_valid_counts are called 90 times in sequence, which is the number of object classes.

So if is there a way to invoke ops parallelly in VirtualMachine to improve the performance?

1 Like

Please try the latest code, both NMS and get valid counts should be much faster since this week. Specifically, these two commits should make GPU NMS much faster.

Ideally it should be possible to batch 90 class NMS in one NMS. PyTorch does that and our GPU NMS code can also take boxes belonging to multiple classes in one go.

Thanks for your reply.

After merging the latest code, the inference time of offloading the backbone to TensorRT and using GPU for other operators decrease from 194.45ms to 85.75ms, which is very remarkable.

And the inference time of offloading the backbone to TensorRT and using CPU remains 115ms

The profile result of GPU operators and TensorRT

#OpName                          	#InvokeCount	    #Duration(us): Sum/Mean/Min/Max
fused_vision_get_valid_counts 	            90        	40474.9/449.721/445.561/452.059
tensorrt_2                    	            1         	32149.6/32149.6/32149.6/32149.6
fused_vision_non_max_suppression	        90        	23548.1/261.646/256.956/354.405

The profle result of CPU operators and TensorRT

#OpName                       	#InvokeCount	#Duration(us): Sum/Mean/Min/Max
tensorrt_2                    	    1         	53196.9/53196.9/53196.9/53196.9
fused_vision_non_max_suppression	90        	25387.6/282.084/263.126/330.301
fused_vision_get_valid_counts 	    90        	21441.1/238.234/201.312/265.247

According to the profile result, the nms and get_valid_counts on CPU is still faster than on GPU when only one nms or get_valid_counts is executed. However, the inference time of the whole network on GPU is much shorter than the one On GPU, and I assume this is caused by the fact that the Instruction in Virtual Machine can be executed only one by one.

So if I want to improve the performance of the situation when deploying on CPU and TensorRT, is there a way to execute some instructions parallelly in the Virtual Machine?

No there is no concurrency of any kind in our runtime. The best way for your case is to batch 90 NMS into one batched NMS. Currently I guess you are doing many small NMS and that’s not a good fit for GPU. That should be why CPU is faster.

If you can change your model, I think you can use this function https://www.tensorflow.org/api_docs/python/tf/image/combined_non_max_suppression to do batched NMS. There is also a way without changing the TF model, but that involves pattern matching and rewrite.