I am new to TVM and am learning by trying to optimize a simple matrix multiplication of two dense matrices. I am using an MI300X and the amdgpu backend.
With pytorch, I get about 100 TFLOPs when multiplying 4k by 4k matrices. I can write a tiled matrix multiplication in HIP directly and get about 80 TFLOPs. But the most i can get from TVM is about 10 TFLOPs. I have tried applying the dlight.matmul schedules but haven’t had much success.
I have also tried using meta_schedule but haven’t been able to get it to provide much speedup over the un-tuned schedules.
Does anyone know where i can look to learn how to get better performance? I am not sure where to go from here.
Really appreciate the project! Thanks