These days I am working on some tensorization stuff, and I found several things that makes the current tensorize interface not sufficient.
First, the tensorization declaration interface requires an TVM op. Originally I suppose it serves the purpose of software emulation (when the underlying hardware has no corresponding intrin support, we can use this Op to replace this code segment at least guarantee the correctness).
However, after using this interface, I realize the true purpose of this parameter is to indicate the shape of input/output data, and what we do in the Op actually does not matter. This is a little bit counter intuitive for developers I suppose. Can we just have an OpaqueOp, that only accepts input shapes and output shapes, and does nothing?
Second, another thing I notice is that tensorization is essentially a âprimitive sugarâ or âcode transformation sugarâ which offloads IRs under certain loop level. This interface is not aware of if this loop body is perfect tiled or not. Thus, this primitive cannot be applied when imperfect loop tiling.
I am curious if we can work around these two issues?
what do you mean exaclty by âimperfect loop tilingâ?
On the first issue, tensorization lets us essentially inline high-performance code that implements a matrix-matrix or matrix-vector multiplication inner-loop body. This is very useful when targeting special hardware intrinsics, like performing AVX512 based GEMV, or invoking an acceleratorâs tensor core ISA, or performing neat tricks like bit-serial operations with vectorized popcount on ARM CPUs.
Thatâs another problem, AVX512 are mostly 1-d instructions, so often it does not care about the shape. (I hope my assertion is correct).
The offloaded intrin still requires the a shape of small tensor, which makes the intrin defined ad-hoc. Sometimes, like doing NCHWxc, it is an across dimension Op. Sometimes, it is a simple 1-D operation. It is hard to find one piece to fit all once shape is introduced.