Expected performance for ToMixedPrecision Float16

Wheest · February 24, 2023, 9:24am

I have been looking into float16 inference in TVM, converting an existing float32 model within TVM, rather than from an external framework.

@AndrewZhaoLuo has some nice code that I’ve been using to test FP16.

However, when running on an NVidia AGX Xavier I find that I’m getting a slowdown on both CPU and GPU. I was also testing on an Intel CPU and saw slowdowns too.

My main project I’m working from is a modified TVM v0.8, but I also tested on an unmodified v0.10 and also saw slowdowns.

I’m just looking at CNN models right now.

The initial PR says that speedups were not initially achieved, but this was later edited to suggest show that some speedups happened.

Are my slowdowns (~1.8x on CPU, 1.05x on GPU) expected right now? I’ve found it with fast-math enabled and disabled, and I’m using all of the opts in the linked sandbox test, including ToMixedPrecision()(mod), plus my regular -o3 compilation. Is that the right approach?

masahi · February 24, 2023, 11:21am

FP16 on x86 is super slow (it’s software-emulated), and on CUDA it’s no faster than fp32 unless you use tensor core.

Wheest · February 24, 2023, 3:03pm

Good to know, thanks! I just wanted to verify that I wasn’t doing anything wrong on my side!