Tensorize Tensor Core GEMM

https://tvm.apache.org/docs/tutorials/optimize/opt_conv_tensorcore.html In this tutorial,Tensor Core GEMM’s brief workload is like:

  1. Load data from global memory to shared memory
  2. Load shared memory to local memory(register)
  3. Do computation and cache the result in register
  4. Write back from register to global memory (or shared)

In this discuss, @Hzfengsy says “we need to use a special instruction to do step 4 if we use Tensor Core. We can just tensorize the copy step to do it.”

My question is: I don’t want to use shared memory as intermediate storage to store result, that will limit the shared memory amount that can be used (-> limit compute intensity). If we use tensorized store copy instructions in step 4, dose that means we can’t fuse gemm with other injective ops?

can anybody explain it or give some suggests, thanks!

You are right that we need to use shared memory to store result, which enables fusion with injective ops.

Hello @Hzfengsy, If we use tensorized stroe copy instructions in step4 to write back from register to global memory without shared, how can we fuse dense with other injective ops? Could you have some suggests for us?

Without using shared memory, fused ops have to be performed with global memory access, which is likely to cause performance issue. CUDA wmma APIs support limited direct access to wmma fragment though (only allow applying the uniform operation to the whole fragment) though, which is right now not supported in codegen