Hi, @Hzfengsy @Shawn_Inspur

Thanks for your efforts on supporing TensorCore on TVM.

I have tuned TensorCore on classical network such as resnet50 & vgg16(32 batch_size). And the tensor_precision_fu_utilization reported by Nvprof shows that I got a Mid/Low utilization on TensorCore:

```
Kernel: fused_nn_conv2d_add_nn_relu_2_kernel0
2 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Mid (4) Mid (4) Mid (4)
Kernel: fused_nn_softmax_kernel3
2 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
Kernel: fused_nn_conv2d_add_nn_relu_3_kernel0
4 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Mid (4) Mid (4) Mid (4)
Kernel: fused_nn_conv2d_add_nn_relu_4_kernel0
2 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Mid (4) Mid (4) Mid (4)
Kernel: fused_nn_batch_flatten_kernel0
2 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
Kernel: fused_nn_conv2d_add_nn_relu_5_kernel0
2 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Mid (4) Mid (4) Mid (4)
Kernel: fused_nn_conv2d_add_nn_relu_6_kernel0
2 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Mid (4) Mid (4) Mid (4)
Kernel: fused_nn_dense_add_kernel0
2 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Low (2) Low (2) Low (2)
Kernel: fused_nn_conv2d_add_nn_relu_7_kernel0
2 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Low (3) Low (3) Low (3)
Kernel: fused_nn_conv2d_add_nn_relu_8_kernel0
2 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
Kernel: fused_nn_conv2d_add_nn_relu_kernel0
```

But when I use cudnn as backend, the utilization is always High.

It seems like that there is still a lot of room for further optimization.Do you have any idea on how to get higher utiliazation for tensor core?