Working on integration for a custom tensor accelerator with BYOC.
So far I have understood how I can tensorize the matrix core operation, input/weights reading, output writing in the case where shapes are heavenly divisible by the intrinsics - the perfect case. Now I need to figure out the general case, i.e. handling any type of shapes.
What I though of:
- introduce padding and slicing at the compute level - which I have a feeling this would lead to extra operations which isn’t ideal
- There is
tir.Schedule.pad_einsum
, which helps me tensorize the matrix core operation, but introduce anif_then_else
statement in the loading which prevents tensorization from working - I went through this github issue from 2017 where @tqchen mentions
tir.transform.InjectCopyIntrin
- I am not sure how to use it in my context - I also saw
tir.transform.LowerAutoCopy
which could be of interest too - although it seems to target GPU compilation
Here is the example module I am working with, it would be really nice if I can get a fully working solution from it - i.e. I can tensorize the core matrix op and loads despite intrinsic shape not fitting perfectly.
from tvm.script import ir as I
from tvm.script import tir as T
from tvm.script import relax as R
@I.ir_module
class Module:
@T.prim_func
def matmul(A: T.Buffer((T.int64(1), T.int64(784)), "float32"), B: T.Buffer((T.int64(784), T.int64(512)), "float32"), T_matmul_NN: T.Buffer((T.int64(1), T.int64(512)), "float32")):
T.func_attr({"layout_free_buffers": [1], "tir.noalias": T.bool(True)})
# with T.block("root"):
for i, j, k in T.grid(T.int64(1), T.int64(512), T.int64(784)):
with T.block("T_matmul_NN"):
v_i, v_j, v_k = T.axis.remap("SSR", [i, j, k])
T.reads(A[v_i, v_k], B[v_k, v_j])
T.writes(T_matmul_NN[v_i, v_j])
with T.init():
T_matmul_NN[v_i, v_j] = T.float32(0)
T_matmul_NN[v_i, v_j] = T_matmul_NN[v_i, v_j] + A[v_i, v_k] * B[v_k, v_j]
@T.prim_func
def matmul1(A: T.Buffer((T.int64(1), T.int64(512)), "float32"), B: T.Buffer((T.int64(512), T.int64(512)), "float32"), T_matmul_NN: T.Buffer((T.int64(1), T.int64(512)), "float32")):
T.func_attr({"layout_free_buffers": [1], "tir.noalias": T.bool(True)})
# with T.block("root"):
for i, j, k in T.grid(T.int64(1), T.int64(512), T.int64(512)):
with T.block("T_matmul_NN"):
v_i, v_j, v_k = T.axis.remap("SSR", [i, j, k])
T.reads(A[v_i, v_k], B[v_k, v_j])
T.writes(T_matmul_NN[v_i, v_j])
with T.init():
T_matmul_NN[v_i, v_j] = T.float32(0)
T_matmul_NN[v_i, v_j] = T_matmul_NN[v_i, v_j] + A[v_i, v_k] * B[v_k, v_j]
@R.function
def main(inp_0: R.Tensor((1, 1, 28, 28), dtype="float32")) -> R.Tensor((1, 512), dtype="float32"):
cls = Module
with R.dataflow():
lv: R.Tensor((1, 784), dtype="float32") = R.reshape(inp_0, R.shape([1, 784]))
lv1: R.Tensor((784, 512), dtype="float32") = R.permute_dims(metadata["relax.expr.Constant"][0], axes=None)
lv2 = R.call_tir(cls.matmul, (lv, lv1), out_sinfo=R.Tensor((1, 512), dtype="float32"))
lv3: R.Tensor((1, 512), dtype="float32") = R.add(lv2, metadata["relax.expr.Constant"][1])
lv4: R.Tensor((512, 512), dtype="float32") = R.permute_dims(metadata["relax.expr.Constant"][2], axes=None)
lv5 = R.call_tir(cls.matmul1, (lv3, lv4), out_sinfo=R.Tensor((1, 512), dtype="float32"))
lv6: R.Tensor((1, 512), dtype="float32") = R.add(lv5, metadata["relax.expr.Constant"][3])
gv: R.Tensor((1, 512), dtype="float32") = lv6
R.output(gv)
return gv
@vinx13 @masahi (this is a follow up from another post on tensorization which got answered so I created this new one)