Matmul performance

charest · June 5, 2025, 12:00pm

I am new to TVM and am learning by trying to optimize a simple matrix multiplication of two dense matrices. I am using an MI300X and the amdgpu backend.

With pytorch, I get about 100 TFLOPs when multiplying 4k by 4k matrices. I can write a tiled matrix multiplication in HIP directly and get about 80 TFLOPs. But the most i can get from TVM is about 10 TFLOPs. I have tried applying the dlight.matmul schedules but haven’t had much success.

I have also tried using meta_schedule but haven’t been able to get it to provide much speedup over the un-tuned schedules.

Does anyone know where i can look to learn how to get better performance? I am not sure where to go from here.

Really appreciate the project! Thanks