Implement Conv2d using Tensor Core

zhoujqhappy · December 4, 2018, 3:15am

Motivation

The volta architecture graphic cards are equipped with tensor cores which largely increase the computation power compared with that of pascal architecture graphic cards. The peak performence of tensor core can be as high as 112 TFLOPS(125 TFLOPS with NVlink) which is nearly 8-9 times the performence of pascal architecture graphic cards. Thus I propose add the Cuda Tensor Core API into tvm. What’s more, to implement a convolution layer with tensor core, I believe we should provide tvm the function that allow as to include our own head file when generating cuda kernel.

Action Needed

1.Add new python APIs, so that one can declare and call cuda tensor core related data structure and APIs, they are: wmma::fragment, wmma::load_matrix_sync,wmma::fill_fragment,wmma::mma_sync,wmma::store_matrix_sync.

2.Add a python APIs, that enables one to include specific C/CPP head files. In my case, I need to deal with matrix loading from global memory to shared memory in elaborately designed order.

tqchen · December 4, 2018, 3:18am

cc @vinx13 @merrymercy

vinx13 · December 4, 2018, 8:34am

Calling wmma APIs can be difficult. wmma::fragment is a template class, we declare a fragment first and then call other APIs to load the data. Therefore call_packed doesn’t work here. I think we need a way to generate / insert C++ code snippets to the generated CUDA kernel.

eqy · December 4, 2018, 6:08pm

From a high level, this should still be a fit to tensorization right? Can we use tensorize to emit a C++ code snippet here?

tqchen · December 4, 2018, 9:01pm

From a high level we can define intrinsics, then modify CUDA code generator to generate code that can make use of wmma:fragment etc.

zhoujqhappy · December 5, 2018, 7:40am

This problem is a little bit trick and can not simply use tensorization. Each wmma API require 32 threads (a warp) to execute from the same address. Tensorize the result will either cause trouble in number of threads or in the value of address.

zhoujqhappy · December 5, 2018, 7:43am

Adding intrinsics seems to be a solution for both the wmma APIs and User defined C/CPP head files.

mingwayzhang · August 5, 2019, 8:40pm

Is there any update in this thread? Please provide a pointer or roadmap if TVM already (or try to) support tensor core feature. Appreciate it.