Hi,
This gist has 2 IR for 4k x 4k gemm kernel generated for ROCm (Vega10 GPU)
asm_gemm_4k.ll, this kernel perform at peak, generated form HIP+inline asm
tvm_gemm_4k.ll, this kernel does not perform well and here are the issues I think are causing it
Issues:
Non vectorized loads. Compiler can find some vectorized loads but not all the time. TVM code gen for ROCm should express float4s rather than floats. (Using floats work good for CUDA as nvcc can find places to vectorize) In order to get peak throughput, explicit declaration as float4 can help
Irregular load patterns from lds. LDS loads are not properly coalesced in IR, using structs or multiples of float4s will get rid of bank conflicts in LDS.
With these fixed, we can judge how good is llvm in translating ir to asm.