Optimizing Matrix Multiplication

I have used matrix multiplication code from 4. Matrix Multiplication — Dive into Deep Learning Compiler 0.1 documentation (d2l.ai). To improve the performance, I wanted to avoid bank conflict and try to leverage double buffering, loop unrolling etc.

I see that TVM provides function such as

vectorize(var) // Vectorize the iteration.

unroll(var) // Unroll the iteration.

parallel(var) //Parallelize the iteration.

prefetch(tensor, var, offset) //Prefetch the specified variable

storage_align(axis, factor, offset) //Set alignment requirement for specific axis

double_buffer() //Compute the current stage via double buffering.

However, there are no good examples on use case of these. Specifically for batched matrix multiplication. I tried adding them to the code whose link I had shared above but didn’t see much improvement in the runtime. Can someone please help me in correctly using these functions for improving performance?