Currently, float16 support for CUDA is incomplete - both functionally and performance-wise. There are few posts that suggest some ways to deal with the functional aspect, but these are not merged in yet. This post is for dealing with the second portion - Performance.
This one talks about half2 vs half data types. half2 is basically float16x2. It seems that we can speedup using FP16 on CUDA only when we use half2 datatype, signaling the hardware to performance two float16 operations simultaneously.
Has anybody prototyped this before? Or has idea how to make this happen?
Just to echo @janimesh note about performance, I ran some PyTorch code and the equivalent TVM-generated code and compared their float32 vs. float16 performance.
Switching to float16 gave the following speedups:
PyTorch: 1.88x
TVM-generated code: 1.17x (significantly lower than PyTorch)
Thanks @ibeltagy for sharing the observation. We need to deep dive into the TVM-CUDA schedules to understand this. Currently, I am not aware of what needs to go in to get speedup.
Unfortunately, I am busy with other portions of TVM project. Also, I am not familiar with CUDA schedules and codegen to quickly come up with list of tasks.
It might be very helpful if we can get someone who has worked on CUDA schedules to come up with a rough plan, and then we can parallelize the efforts.
The first thing (codegen) is already supported. For the second part, current topi operators support fp16, but does not guarantee good performance. One thing we need to do is to support vectorized type half2. We need to update codegen for cuda to generate half2 type in the code