I’m working on accelerating the batched matrix multiplication project, I have read the Alibaba’s work which is about machine translation acceleration with TVM, now I also want do some experiment on batched-matmul with TVM.
I noted that for each time, TVM can only generate one fix-size matmil-function.
However, for a higher level function like cublasSgemmBatched , it can fix any size.
Does this means I should have to generate several fix-size kernals, then wrapped them into a function like cublasSgemmBatched , or there is some smarter way to directly build a function like cublasSgemmBatched does>
The shapes (sizes) will dramatically affect the speed of different kernel implementations. Typically in TVM the approach is to define a single schedule template that describes a class of kernel implementations without filling in details such as tiling sizes or thread block dimensions, and to then automatically tune this template for each configuration. This is our “smarter” way of generating variations for each shape that you want to support, but at the end of the day we do recompile for each different shape.
Longer explanation below…
You will likely find that even though routines in vendor specific libraries support “every” size in a “single” kernel, there are usually many shape-specific branches in the code which will either vary implementation details or use a different strategy altogether. Even worse, the rules for which strategy to pick may not be intuitive and may leave corner-cases poorly supported. The “coverage” provided by libraries can often be an illusion created by many hours of laborious manual tuning.
We also find that TVM’s performance advantages often come from the ability to highly specialize to a particular operator shape. Often, the newest workloads (e.g., cutting edge CNN topologies) are the ones where TVM enjoys the highest performance relative to hand-tuned libraries because of their limited support for some shapes despite being “generic.”
Thank you so much for your kind reply.
So based on your suggestion, can I take this problem in this way: I should working towards on the special shape computation or special case based on our model, other than to beat cublasSgemmBatched in a whole, right?