Double buffer implementation for matrix multiply

I used the following tvm schedule to generate and run sGEMM kernel on vega10 machine.

One feature that gives high performance is double buffering of loads for A and B into local memory, such as to pipeline fetches with compute. Here’s a sequence of operations as observed with the generated gcn assembly.

            global_load_dwordx4 v[72:75], v[72:73], off

            global_load_dwordx4 v[76:79], v[70:71], off

            global_load_dwordx4 v[81:84], v[81:82], off

            global_load_dwordx4 v[85:88], v[85:86], off



            ds_write_b64 v90, v[78:79] offset:8

            ds_write2_b64 v90, v[76:77], v[74:75] offset1:17

            s_waitcnt vmcnt(0)

            ds_write2_b64 v89, v[85:86], v[87:88] offset1:1

            ds_write_b64 v89, v[83:84] offset:136

            ds_write2st64_b64 v91, v[81:82], v[72:73] offset1:8



            ds_read2_b64 v[76:79], v80 offset1:16

            ds_read2_b64 v[72:75], v70 offset1:16

            ds_read2_b64 v[81:84], v70 offset0:1 offset1:17

            ds_read_b64 v[85:86], v71 offset:6024

            ds_read_b64 v[87:88], v80 offset:8



            v_mac_f32_e32 v66, v76, v72

          …

            ds_read2_b64 v[76:79], v76 offset1:15

I believe both global and local latency hiding mechanisms are absent in this implementation.

I think the double buffering has to be tied to the unroll factor for each of the tensors

For a 4K matrix this TVM schedule gives a performance of 7.5 Tflops as compared to 12.2 Tflops with an optimized kernel on Vega10

One thing that might be helpful to do, is to use the callback hack(maybe in opencl) to manually hijack the code. Start from a TVM generated version, and do minimum manual change to arrive at a double buffered version. see

there is also similar callback for opencl that allows us to hijack the code and do gradual manual change. Doing so would help us understand what is the minimum code transformation we need to get the best perf

Looking at the assembly snippet, we need to insert the math operations (v_mac_f32) after the loads and before the s_waitcnt vmcnt(0). Ideally we would do the global_load, then many iterations of macs, then the vmcnt and ds_write operations at the end of the loop. Is it possible to modify the TVM scheduler to move down this waitcnt?

Here’s how the double buffering structured right now

load_global AA
store_local AA
load_global BB
store_local BB

branch BB0_2

BB0_1:
Load global AA
Store local AA
load_global BB
store_local BB
Load local AL
Load local BL
FMA
Load local AL
Load local BL
FMA

BB0_2:
barrier
branch if loopcount > 0 BB0_1

The correct implementation would put a distance between global load and its write to local memory.

load_global AA
load_global BB

branch BB0_2

BB0_1:
Load global AA
load_global BB
Load local AL(i-1)
Load local BL(i-1)
FMA
Load local AL(i-1)
Load local BL(i-1)
FMA

BB0_2:
store_local AA(i)
store_local BB(i)
barrier
branch if loopcount > 0 BB0_1

I tried another experiment to check for double buffering into registers from local memory

s[AL].double_buffer()
s[BL].double_buffer()

instead of

s[AA].double_buffer()
s[BB].double_buffer()

This improves performance to 7.9 TF

I experimented with injecting modified TVM generated GCN assembly back into the framework. Moving the waitcnt and write instructions towards the end of the unroll loop improves performance by ~15% (8126 GFLOPS vs 7016 GFLOPS) for a 4k square matrix

you might be interested in inline asm PR.

TVM schedule pass allows caching of global memory loads into shared memory address space.
AA = s.cache_read(A, “shared”, [CC])

If I have to separate the writes to local memory from the reads from global, is there a way to do that as a schedule directive?

Good point, I think it should be possible but I’m not exactly sure how. You can take at cuda schedules for conv2d for inspiration.

They use cache_read on “global” followed by cache_read on “shared”. I don’t know why they do it this way, but it may be related to your problem.

@tqchen can you comment? We want to know if it is possible to prevent share mem write that immediately follows global read. It stalls everything.

@milindn btw, does llvm amdgpu backend do such schedule optimization?

Amd llvm backend does not push these writes out in the double buffer case. The fetch is overlapped with compute that reads from the same tensor from local memory, although from a different instance. I would rather have the schedule push it out.