FP16 Inference with YOLOX Doesn't Provide Speedup

I have been benchmarking YOLOX on an NVIDIA GPU with TVM and TensorRT, but for some reason fp16 does not provide any speedup for TVM (it provides a huge speedup for TensorRT). Is this expected? Is there a way I can improve fp16 performance?

I have tuned both fp32 and fp16 models without the autoscheduler until I stopped getting a speed increase (I experimented with the autoscheduler, and the results were worse).

My modified version of the YOLOX github repository is here where you can find the modified evaluation script and tvm tuning script.

The GPU I’m using is NVIDIA GeForce RTX 3060 Laptop GPU.

Here are some of the results I have gotten. I list the base model, YOLOX run with TensorRT, and then YOLOX run with TVM for YOLOX-M and YOLOX-TINY.

Model YOLOX-M fp32 YOLOX-M fp16 YOLOX-M TRT fp32 YOLOX-M TRT fp16 YOLOX-M TVM fp32 YOLOX-M TVM fp16
mAP 46.9 46.9 47.1 47.1 47.1 47.1
forward time 17.72 ms 12.13 ms 14.58 ms 6.38 ms 14.20 ms 14.15 ms
mAP 32.8 32.8 33.0 33.0 33.0 33.0
forward time 6.18 ms 6.70 ms 3.86 ms 2.24 ms 4.19 ms 5.57 ms

As you can see, mixed precision evaluation does not offer a noticeable speed increase in the case of TVM.

Let me know if I can provide any more useful information!


This thread may be useful. TLDR unless you use Tensor Core FP16 will not give you the performance improvements seen in TRT.

Thanks for the response!

I see. However, if I am not mistaken both the GPU I am using (NVIDIA GeForce RTX 3060 Laptop GPU) as well as the GPU used in the thread you referenced (NVIDIA AGX Xavier) have Tensor Cores. Is the issue that TVM does not make use of them automatically? If so, what do I need to do to make use of them?

To use tensor core in tuning, you need to use meta schedule, not auto scheduler. Unfortunately tvmc doesn’t seem to support meta schedule yet. See an example of how to use MS here https://github.com/apache/tvm/blob/main/python/tvm/meta_schedule/testing/tune_onnx.py

MS tutorial is also overdue. cc @zxybazh @junrushao

If you are Ok with using external compilers as well as TVM you can also use the BYOC path to partition your Relay graph into TensorRT and CPU subgraphs. That way TensorRT will optimize its subgraph. You can choose between FP16 and FP32 runtimes. Another possibility may be using the Cutlass library.

Hope that helps.

Ah, I see now. Thank you guys! I will look into all this!