FP16 Inference with YOLOX Doesn't Provide Speedup

bgtier4 · February 23, 2023, 10:13pm

I have been benchmarking YOLOX on an NVIDIA GPU with TVM and TensorRT, but for some reason fp16 does not provide any speedup for TVM (it provides a huge speedup for TensorRT). Is this expected? Is there a way I can improve fp16 performance?

I have tuned both fp32 and fp16 models without the autoscheduler until I stopped getting a speed increase (I experimented with the autoscheduler, and the results were worse).

My modified version of the YOLOX github repository is here where you can find the modified evaluation script and tvm tuning script.

The GPU I’m using is NVIDIA GeForce RTX 3060 Laptop GPU.

Here are some of the results I have gotten. I list the base model, YOLOX run with TensorRT, and then YOLOX run with TVM for YOLOX-M and YOLOX-TINY.

Model	YOLOX-M fp32	YOLOX-M fp16	YOLOX-M TRT fp32	YOLOX-M TRT fp16	YOLOX-M TVM fp32	YOLOX-M TVM fp16
mAP	46.9	46.9	47.1	47.1	47.1	47.1
forward time	17.72 ms	12.13 ms	14.58 ms	6.38 ms	14.20 ms	14.15 ms

Model	YOLOX-TINY fp32	YOLOX-TINY fp16	YOLOX-TINY TRT fp32	YOLOX-TINY TRT fp16	YOLOX-TINY TVM	YOLOX-TINY TVM fp16
mAP	32.8	32.8	33.0	33.0	33.0	33.0
forward time	6.18 ms	6.70 ms	3.86 ms	2.24 ms	4.19 ms	5.57 ms

As you can see, mixed precision evaluation does not offer a noticeable speed increase in the case of TVM.

Let me know if I can provide any more useful information!

alopez_13 · February 24, 2023, 4:04pm

Hello,

This thread may be useful. TLDR unless you use Tensor Core FP16 will not give you the performance improvements seen in TRT.

bgtier4 · February 24, 2023, 8:47pm

Thanks for the response!

I see. However, if I am not mistaken both the GPU I am using (NVIDIA GeForce RTX 3060 Laptop GPU) as well as the GPU used in the thread you referenced (NVIDIA AGX Xavier) have Tensor Cores. Is the issue that TVM does not make use of them automatically? If so, what do I need to do to make use of them?

masahi · February 24, 2023, 10:48pm

To use tensor core in tuning, you need to use meta schedule, not auto scheduler. Unfortunately tvmc doesn’t seem to support meta schedule yet. See an example of how to use MS here https://github.com/apache/tvm/blob/main/python/tvm/meta_schedule/testing/tune_onnx.py

MS tutorial is also overdue. cc @zxybazh @junrushao

alopez_13 · February 25, 2023, 3:52pm

If you are Ok with using external compilers as well as TVM you can also use the BYOC path to partition your Relay graph into TensorRT and CPU subgraphs. That way TensorRT will optimize its subgraph. You can choose between FP16 and FP32 runtimes. Another possibility may be using the Cutlass library.

Hope that helps.

bgtier4 · February 26, 2023, 2:32pm

Ah, I see now. Thank you guys! I will look into all this!