I’ve been working on AMP models on CUDA recently and found that half2
is not always enabled in the injective schedule due to the configuration of block size and thread numbers. I made a script for benchmarking, and this is the output on T4 (max thread number=1024):
FP32Mul_FP16Add, block=256, thread=1024, prod=262144: 0.1273ms, use-half2? False
FP32Mul_FP16Add, block=128, thread=1024, prod=131072: 0.1363ms, use-half2? False
FP32Mul_FP16Add, block= 64, thread=1024, prod= 65536: 0.1012ms, use-half2? True
FP32Mul_FP16Add, block=256, thread= 512, prod=131072: 0.1167ms, use-half2? False
FP32Mul_FP16Add, block=128, thread= 512, prod= 65536: 0.1023ms, use-half2? True
FP32Mul_FP16Add, block= 64, thread= 512, prod= 32768: 0.0993ms, use-half2? True --> best
==============
FP16Mul_FP16Add, block=256, thread=1024, prod=262144: 0.1260ms, use-half2? False
FP16Mul_FP16Add, block=128, thread=1024, prod=131072: 0.1364ms, use-half2? False
FP16Mul_FP16Add, block= 64, thread=1024, prod= 65536: 0.1013ms, use-half2? True
FP16Mul_FP16Add, block=256, thread= 512, prod=131072: 0.1181ms, use-half2? False
FP16Mul_FP16Add, block=128, thread= 512, prod= 65536: 0.1024ms, use-half2? True
FP16Mul_FP16Add, block= 64, thread= 512, prod= 32768: 0.0994ms, use-half2? True --> best
==============
FP16Mul, block=256, thread=1024, prod=262144: 0.1266ms, use-half2? False
FP16Mul, block=128, thread=1024, prod=131072: 0.1357ms, use-half2? False
FP16Mul, block= 64, thread=1024, prod= 65536: 0.0993ms, use-half2? True --> best
FP16Mul, block=256, thread= 512, prod=131072: 0.1246ms, use-half2? False
FP16Mul, block=128, thread= 512, prod= 65536: 0.1007ms, use-half2? True
FP16Mul, block= 64, thread= 512, prod= 32768: 0.0994ms, use-half2? True
==============
Cast, block=256, thread=1024, prod=262144: 0.0876ms, use-half2? False
Cast, block=128, thread=1024, prod=131072: 0.0938ms, use-half2? False
Cast, block= 64, thread=1024, prod= 65536: 0.0670ms, use-half2? True
Cast, block=256, thread= 512, prod=131072: 0.0802ms, use-half2? False
Cast, block=128, thread= 512, prod= 65536: 0.0629ms, use-half2? True --> best
Cast, block= 64, thread= 512, prod= 32768: 0.0638ms, use-half2? True
==============
I got two insights from the results:
- The best configurations are always the one that enables
half2
. - In my workload with input tensor (768, 3072), seems like
half2
can be used whenblock x thread <= 65536
.
Although we can make the injective schedule tunable and get the best configuration with AutoTVM/Ansor, I’m seeking for some improvements to the injective schedule to make use of half2
when possible. However, since I’m not really a CUDA/GPU expert, I’m not sure if these insights are generally applicable.
The benchmark script is available here: benchmark_cuda_injective_schedule.py · GitHub
@wpan11nv @vinx13 @Laurawly @masahi @AndrewZhaoLuo do you have any suggestions on this? Thanks.