The volta architecture graphic cards are equipped with tensor cores which largely increase the computation power compared with that of pascal architecture graphic cards. The peak performence of tensor core can be as high as 112 TFLOPS(125 TFLOPS with NVlink) which is nearly 8-9 times the performence of pascal architecture graphic cards. Thus I propose add the Cuda Tensor Core API into tvm. What’s more, to implement a convolution layer with tensor core, I believe we should provide tvm the function that allow as to include our own head file when generating cuda kernel.
Action Needed
1.Add new python APIs, so that one can declare and call cuda tensor core related data structure and APIs, they are: wmma::fragment, wmma::load_matrix_sync,wmma::fill_fragment,wmma::mma_sync,wmma::store_matrix_sync.
2.Add a python APIs, that enables one to include specific C/CPP head files. In my case, I need to deal with matrix loading from global memory to shared memory in elaborately designed order.
Calling wmma APIs can be difficult. wmma::fragment is a template class, we declare a fragment first and then call other APIs to load the data. Therefore call_packed doesn’t work here. I think we need a way to generate / insert C++ code snippets to the generated CUDA kernel.
This problem is a little bit trick and can not simply use tensorization. Each wmma API require 32 threads (a warp) to execute from the same address. Tensorize the result will either cause trouble in number of threads or in the value of address.