[TIR] Problem inlining addition into matmul block

I see, indeed. For this case, the code should already be efficient enough. The temp will get narrowed into a size 1 buffer and then into register during codegen