AS we known, tensorize is one schedule primitive tensorize
,designed to help people replace a unit of computation with the corresponding intrinsics. expecially usage on DSL accelerator.
I want to do try tensorize some loops body part of conv2d using designed intrinsic.
python
Graphy Relay IR:
fn (%data: Tensor[(1, 128, 38, 38), float32], %weight: Tensor[(64, 128, 3, 3), float32]) -> Tensor[(1, 64, 36, 36), float32] {
nn.conv2d(%data, %weight, channels=64, kernel_size=[3, 3]) /* ty=Tensor[(1, 64, 36, 36), float32] */
}
lowered stmt for conv2d:
produce compute {
for (c_out, 0, 64) {
for (h, 0, 36) {
for (w, 0, 36) {
compute[(((c_out*1296) + (h*36)) + w)] = 0f
for (rc, 0, 128) {
for (ry, 0, 3) {
for (rx, 0, 3) {
compute[(((c_out*1296) + (h*36)) + w)] = (compute[(((c_out*1296) + (h*36)) + w)] + (placeholder[(((((rc*1444) + (h*38)) + (ry*38)) + w) + rx)]*placeholder[((((c_out*1152) + (rc*9)) + (ry*3)) + rx)]))
}
}
}
}
}
}
}
After [oh,ow] fuse && split[Ntohow,tohow] [Ntoc,toc] && reorder (n,Ntoc, Ntohow,tohow,toc,ic,kh,kw):
the lowered stmt as below:
produce compute {
for (c_out.outer, 0, 4) {
for (h.w.fused.outer, 0, 36) {
for (c_out.inner, 0, 16) {
for (h.w.fused.inner, 0, 36) {
compute[((((c_out.outer*20736) + (c_out.inner*1296)) + (h.w.fused.outer*36)) + h.w.fused.inner)] = 0f
for (rc, 0, 128) {
for (ry, 0, 3) {
for (rx, 0, 3) {
compute[((((c_out.outer*20736) + (c_out.inner*1296)) + (h.w.fused.outer*36)) + h.w.fused.inner)] = (compute[((((c_out.outer*20736) + (c_out.inner*1296)) + (h.w.fused.outer*36)) + h.w.fused.inner)] + (placeholder[(((((rc*1444) + (h.w.fused.outer*38)) + (ry*38)) + rx) + h.w.fused.inner)]*placeholder[(((((c_out.outer*18432) + (c_out.inner*1152)) + (rc*9)) + (ry*3)) + rx)]))
}
}
}
}
}
}
}
}
I want to design one intric to replace innermost 6 loop with on intrinsic reference decl_tensor_intrin . The lowered intrisic as bellow:
produce compute {
for (w, 0, 36) {
for (c, 0, 16) {
compute[((c*36) + w)] = 0f
for (rc, 0, 128) {
for (ry, 0, 3) {
for (rx, 0, 3) {
compute[((c*36) + w)] = (compute[((c*36) + w)] + (data[((((rc*114) + (ry*38)) + w) + rx)]*kernel[((((c*1152) + (rc*9)) + (ry*3)) + rx)]))
}
}
}
}
}
}
when I try to tensorize the innermost loop with this intrisic as
schedule_name[out].tensorize(tohow, intrinsic_block_conv2d)
when lowered it, report errors:
File "...vm/src/pass/storage_flatten.cc", line 433
TVMError: Check failed: slice->strides.size() == 0U (4 vs. 0) : Trying to bind compact buffer to strided one strides=[184832, 1444, 38, 1]
As I known, the tensorize is complex, will call MakeTensorize, VerifyTensorizeBody,InferBound to check before tensorize intrinsic successfully. so,my question is what should I do when meet this errors, if there are some steps to desgin the compute pattern to full fill tensorize argument. must do I clear the principle of tensorize mechanism? thanks.
Another question is that if can I design intrin_func according to my needs, for example, intrin_func is just simple strings.
def intrin_func(ins, outs):
ib = tvm.ir_builder.create()
ib.emit(tvm.stmt.stmt_seq(
tvm.call_extern("int32", "Command", "read_dma"),
tvm.call_extern("int32", "Command", "mac"),
tvm.call_extern("int32", "Command", "write_dma"))
return ib.get()
return tvm.decl_tensor_intrin(ofm.op, intrin_func, binds={})