https://tvm.apache.org/docs/tutorials/optimize/opt_conv_tensorcore.html In this tutorial,Tensor Core GEMM’s brief workload is like:
- Load data from global memory to shared memory
- Load shared memory to local memory(register)
- Do computation and cache the result in register
- Write back from register to global memory (or shared)
In this discuss, @Hzfengsy says “we need to use a special instruction to do step 4 if we use Tensor Core. We can just tensorize the copy step to do it.”
My question is: I don’t want to use shared memory as intermediate storage to store result, that will limit the shared memory amount that can be used (-> limit compute intensity). If we use tensorized store copy instructions in step 4, dose that means we can’t fuse gemm with other injective ops?
can anybody explain it or give some suggests, thanks!