I have used matrix multiplication code from 4. Matrix Multiplication — Dive into Deep Learning Compiler 0.1 documentation (d2l.ai). To improve the performance, I wanted to avoid bank conflict and try to leverage double buffering, loop unrolling etc.
I see that TVM provides function such as
vectorize(var) // Vectorize the iteration. unroll(var) // Unroll the iteration. parallel(var) //Parallelize the iteration. prefetch(tensor, var, offset) //Prefetch the specified variable storage_align(axis, factor, offset) //Set alignment requirement for specific axis double_buffer() //Compute the current stage via double buffering.
However, there are no good examples on use case of these. Specifically for batched matrix multiplication. I tried adding them to the code whose link I had shared above but didn’t see much improvement in the runtime. Can someone please help me in correctly using these functions for improving performance?