Best way to deal with kernel layout?

Thanks for the suggestion, @comaniac . Adding matmul operator with implementations of all combinations of inputs’ layouts seems overkill to me. Instead, adding a target-specific relay pass to deal with such target-specific case would be a better solution, which is lightweight and orthogonal to main TVM passes.