I have been benchmarking YOLOX on an NVIDIA GPU with TVM and TensorRT, but for some reason fp16 does not provide any speedup for TVM (it provides a huge speedup for TensorRT). Is this expected? Is there a way I can improve fp16 performance?
I have tuned both fp32 and fp16 models without the autoscheduler until I stopped getting a speed increase (I experimented with the autoscheduler, and the results were worse).
My modified version of the YOLOX github repository is here where you can find the modified evaluation script and tvm tuning script.
The GPU I’m using is NVIDIA GeForce RTX 3060 Laptop GPU.
Here are some of the results I have gotten. I list the base model, YOLOX run with TensorRT, and then YOLOX run with TVM for YOLOX-M and YOLOX-TINY.
|Model||YOLOX-M fp32||YOLOX-M fp16||YOLOX-M TRT fp32||YOLOX-M TRT fp16||YOLOX-M TVM fp32||YOLOX-M TVM fp16|
|forward time||17.72 ms||12.13 ms||14.58 ms||6.38 ms||14.20 ms||14.15 ms|
|Model||YOLOX-TINY fp32||YOLOX-TINY fp16||YOLOX-TINY TRT fp32||YOLOX-TINY TRT fp16||YOLOX-TINY TVM||YOLOX-TINY TVM fp16|
|forward time||6.18 ms||6.70 ms||3.86 ms||2.24 ms||4.19 ms||5.57 ms|
As you can see, mixed precision evaluation does not offer a noticeable speed increase in the case of TVM.
Let me know if I can provide any more useful information!