You are right. Thank you for figuring out the bug.
That’s would be my fault that I focused on the classical workload (e.g. resnet), but forgot to test large shapes. It’s easy to fix. Can you please create a PR?
Thanks for your efforts on supporing TensorCore on TVM.
I have tuned TensorCore on classical network such as resnet50 & vgg16(32 batch_size). And the tensor_precision_fu_utilization reported by Nvprof shows that I got a Mid/Low utilization on TensorCore:
Kernel: fused_nn_conv2d_add_nn_relu_2_kernel0
2 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Mid (4) Mid (4) Mid (4)
Kernel: fused_nn_softmax_kernel3
2 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
Kernel: fused_nn_conv2d_add_nn_relu_3_kernel0
4 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Mid (4) Mid (4) Mid (4)
Kernel: fused_nn_conv2d_add_nn_relu_4_kernel0
2 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Mid (4) Mid (4) Mid (4)
Kernel: fused_nn_batch_flatten_kernel0
2 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
Kernel: fused_nn_conv2d_add_nn_relu_5_kernel0
2 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Mid (4) Mid (4) Mid (4)
Kernel: fused_nn_conv2d_add_nn_relu_6_kernel0
2 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Mid (4) Mid (4) Mid (4)
Kernel: fused_nn_dense_add_kernel0
2 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Low (2) Low (2) Low (2)
Kernel: fused_nn_conv2d_add_nn_relu_7_kernel0
2 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Low (3) Low (3) Low (3)
Kernel: fused_nn_conv2d_add_nn_relu_8_kernel0
2 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
Kernel: fused_nn_conv2d_add_nn_relu_kernel0
But when I use cudnn as backend, the utilization is always High.
It seems like that there is still a lot of room for further optimization.Do you have any idea on how to get higher utiliazation for tensor core?
Yes, I agree that TVM on Tensor Core GPUs do have a lot of room to optimize. Currently we are optimizing the data path between global memory and registers, and we think this is a major bottleneck. We are trying to experiment on different layout of both feature maps and weights. We have found that weights with ‘HWOI’ layout, as suggested by @Hzfengsy, do improve performance for int8 inference on Tensor Core.
I am not sure whether my post here is meaningful, but I draw a conclusion from my recent test like this:
fp16 is not always faster than fp32. WIth some model, fp16 is faster but there are also some model that fp32 runs faster than fp16. (both after tuning, each task 2000 trials).
tvm fp16 is slower than tensorrt fp16 inference. On my platform, my model achieves about 35 fps with tvm, but achieves 58 fps with tensorrt.
Do not know what did I miss. Waiting to see the tutorials about the usage of tvm in fp16 mode inference.
If you use TVM to compile operators that support tensorcore shape, then tensorcore should be called automatically on T4. So I guess you used some operators that don’t support tensorcore shape (like batch=1)?
Yes, I used batchsize=1, and I compiled my model through onnx, rather than a single operator. I compiled the same model with tensorrt, and the trt speed is much faster than tvm(fp16 mode). Maybe we could wait for more tvm updates and optimizations.
OK, in the case of batch=1, due to the realization of TVM TOPI, the operator cannot be optimized by tensorcore. However, in other libraries (such as TensorRT), in this case, it can use tensorcore through img2col.
You can make your conv2d operators meet these conditions to get the optimization of tensorcore in TVM.
The runtime will expect the input to match the compiled lib. Also, the compiler requires a constant batch size value.
Not sure if using the VM may be of help to you? I have not used it myself and I have the impression that this may not address your problem, but its worth looking into.
Hi, I am interested in the TVM’s performance of conv2d operator on Tensorcore. I experimented on V100 and T4 platforms using the schedule template in file ‘topi/cuda/cond2d_nhwc_tensorcore’. Results show that AutoTVM never performs better than Cudnn on six commonly used shapes in float16 mode. In some cases (like conv2d_nhwc_32_56_56_256_3_3_64_1_0), AutoTVM’s tuned results only achieve about 50% of Cudnn’s performance. I wonder whether there exist some cases (shape or data layouts) in that AutoTVM performs better than Cudnn? Or can the template be further optimized?