Currently, float16 support for CUDA is incomplete - both functionally and performance-wise. There are few posts that suggest some ways to deal with the functional aspect, but these are not merged in yet. This post is for dealing with the second portion - Performance.
I was reading this paper - https://www.comp.nus.edu.sg/~wongwf/papers/hpec17.pdf
This one talks about half2
vs half
data types. half2
is basically float16x2
. It seems that we can speedup using FP16 on CUDA only when we use half2
datatype, signaling the hardware to performance two float16 operations simultaneously.
Has anybody prototyped this before? Or has idea how to make this happen?