Motivation
The volta architecture graphic cards are equipped with tensor cores which largely increase the computation power compared with that of pascal architecture graphic cards. The peak performence of tensor core can be as high as 112 TFLOPS(125 TFLOPS with NVlink) which is nearly 8-9 times the performence of pascal architecture graphic cards. Thus I propose add the Cuda Tensor Core API into tvm. What’s more, to implement a convolution layer with tensor core, I believe we should provide tvm the function that allow as to include our own head file when generating cuda kernel.
Action Needed
1.Add new python APIs, so that one can declare and call cuda tensor core related data structure and APIs, they are: wmma::fragment, wmma::load_matrix_sync,wmma::fill_fragment,wmma::mma_sync,wmma::store_matrix_sync.
2.Add a python APIs, that enables one to include specific C/CPP head files. In my case, I need to deal with matrix loading from global memory to shared memory in elaborately designed order.