Tensorize Tensor Core GEMM

njuerect · November 14, 2022, 1:49am

https://tvm.apache.org/docs/tutorials/optimize/opt_conv_tensorcore.html In this tutorial，Tensor Core GEMM’s brief workload is like:

Load data from global memory to shared memory
Load shared memory to local memory(register)
Do computation and cache the result in register
Write back from register to global memory (or shared)

In this discuss, @Hzfengsy says “we need to use a special instruction to do step 4 if we use Tensor Core. We can just tensorize the copy step to do it.”

My question is: I don’t want to use shared memory as intermediate storage to store result, that will limit the shared memory amount that can be used (-> limit compute intensity). If we use tensorized store copy instructions in step 4, dose that means we can’t fuse gemm with other injective ops?

can anybody explain it or give some suggests, thanks！

Hzfengsy · November 14, 2022, 4:26am

You are right that we need to use shared memory to store result, which enables fusion with injective ops.

njjiang · November 15, 2022, 7:47am

Hello @Hzfengsy, If we use tensorized stroe copy instructions in step4 to write back from register to global memory without shared, how can we fuse dense with other injective ops? Could you have some suggests for us?

vinx13 · November 15, 2022, 9:23pm

Without using shared memory, fused ops have to be performed with global memory access, which is likely to cause performance issue. CUDA wmma APIs support limited direct access to wmma fragment though (only allow applying the uniform operation to the whole fragment) though, which is right now not supported in codegen