I used the following tvm schedule to generate and run sGEMM kernel on vega10 machine.
One feature that gives high performance is double buffering of loads for A and B into local memory, such as to pipeline fetches with compute. Here’s a sequence of operations as observed with the generated gcn assembly.
One thing that might be helpful to do, is to use the callback hack(maybe in opencl) to manually hijack the code. Start from a TVM generated version, and do minimum manual change to arrive at a double buffered version. see
there is also similar callback for opencl that allows us to hijack the code and do gradual manual change. Doing so would help us understand what is the minimum code transformation we need to get the best perf
Looking at the assembly snippet, we need to insert the math operations (v_mac_f32) after the loads and before the s_waitcnt vmcnt(0). Ideally we would do the global_load, then many iterations of macs, then the vmcnt and ds_write operations at the end of the loop. Is it possible to modify the TVM scheduler to move down this waitcnt?
I experimented with injecting modified TVM generated GCN assembly back into the framework. Moving the waitcnt and write instructions towards the end of the unroll loop improves performance by ~15% (8126 GFLOPS vs 7016 GFLOPS) for a 4k square matrix
Amd llvm backend does not push these writes out in the double buffer case. The fetch is overlapped with compute that reads from the same tensor from local memory, although from a different instance. I would rather have the schedule push it out.