[CUDA] Enable half2 in CUDA injective schedule

comaniac · July 9, 2021, 7:43pm

I’ve been working on AMP models on CUDA recently and found that half2 is not always enabled in the injective schedule due to the configuration of block size and thread numbers. I made a script for benchmarking, and this is the output on T4 (max thread number=1024):

   FP32Mul_FP16Add, block=256, thread=1024, prod=262144: 0.1273ms, use-half2? False
   FP32Mul_FP16Add, block=128, thread=1024, prod=131072: 0.1363ms, use-half2? False
   FP32Mul_FP16Add, block= 64, thread=1024, prod= 65536: 0.1012ms, use-half2? True
   FP32Mul_FP16Add, block=256, thread= 512, prod=131072: 0.1167ms, use-half2? False
   FP32Mul_FP16Add, block=128, thread= 512, prod= 65536: 0.1023ms, use-half2? True
   FP32Mul_FP16Add, block= 64, thread= 512, prod= 32768: 0.0993ms, use-half2? True --> best
==============
   FP16Mul_FP16Add, block=256, thread=1024, prod=262144: 0.1260ms, use-half2? False
   FP16Mul_FP16Add, block=128, thread=1024, prod=131072: 0.1364ms, use-half2? False
   FP16Mul_FP16Add, block= 64, thread=1024, prod= 65536: 0.1013ms, use-half2? True
   FP16Mul_FP16Add, block=256, thread= 512, prod=131072: 0.1181ms, use-half2? False
   FP16Mul_FP16Add, block=128, thread= 512, prod= 65536: 0.1024ms, use-half2? True
   FP16Mul_FP16Add, block= 64, thread= 512, prod= 32768: 0.0994ms, use-half2? True --> best
==============
           FP16Mul, block=256, thread=1024, prod=262144: 0.1266ms, use-half2? False
           FP16Mul, block=128, thread=1024, prod=131072: 0.1357ms, use-half2? False
           FP16Mul, block= 64, thread=1024, prod= 65536: 0.0993ms, use-half2? True --> best
           FP16Mul, block=256, thread= 512, prod=131072: 0.1246ms, use-half2? False
           FP16Mul, block=128, thread= 512, prod= 65536: 0.1007ms, use-half2? True
           FP16Mul, block= 64, thread= 512, prod= 32768: 0.0994ms, use-half2? True
==============
              Cast, block=256, thread=1024, prod=262144: 0.0876ms, use-half2? False
              Cast, block=128, thread=1024, prod=131072: 0.0938ms, use-half2? False
              Cast, block= 64, thread=1024, prod= 65536: 0.0670ms, use-half2? True
              Cast, block=256, thread= 512, prod=131072: 0.0802ms, use-half2? False
              Cast, block=128, thread= 512, prod= 65536: 0.0629ms, use-half2? True --> best
              Cast, block= 64, thread= 512, prod= 32768: 0.0638ms, use-half2? True
==============

I got two insights from the results:

The best configurations are always the one that enables half2.
In my workload with input tensor (768, 3072), seems like half2 can be used when block x thread <= 65536.

Although we can make the injective schedule tunable and get the best configuration with AutoTVM/Ansor, I’m seeking for some improvements to the injective schedule to make use of half2 when possible. However, since I’m not really a CUDA/GPU expert, I’m not sure if these insights are generally applicable.

The benchmark script is available here: benchmark_cuda_injective_schedule.py · GitHub

@wpan11nv @vinx13 @Laurawly @masahi @AndrewZhaoLuo do you have any suggestions on this? Thanks.

vinx13 · July 11, 2021, 4:25pm

Looks like when the block/thread size are too large, split introduced if-conditions and then vectorization failed.
We can change vector width here to 2 as we’d to use half2 benchmark_cuda_injective_schedule.py · GitHub.
In your case, 768 * 3072 / 65536 / 4 == 9, therefore any larger block size will not be divisible

comaniac · July 11, 2021, 6:23pm

Thanks for the explanation which makes sense to me. So formally, we require the result of shape_prod / block_thread_prod / vec_width to be an integer to avoid the if-condition. In addition to changing the vector length to 2, do you think we should also make sure it’s divisible when selecting block and thread size for float16?

vinx13 · July 11, 2021, 6:43pm

Yes we should try to use divisible block/thread size, or at lease block/thread size such that the condition always holds for the inner loop so that it can be lifted outside the vectorized loop