Ask Methods to fix error when using tensorize schedule

xqch1983 · April 3, 2020, 12:05pm

AS we known, tensorize is one schedule primitive tensorize,designed to help people replace a unit of computation with the corresponding intrinsics. expecially usage on DSL accelerator. I want to do try tensorize some loops body part of conv2d using designed intrinsic. python Graphy Relay IR:

     fn (%data: Tensor[(1, 128, 38, 38), float32], %weight: Tensor[(64, 128, 3, 3), float32]) -> Tensor[(1, 64, 36, 36), float32] {
  nn.conv2d(%data, %weight, channels=64, kernel_size=[3, 3]) /* ty=Tensor[(1, 64, 36, 36), float32] */
}

lowered stmt for conv2d:

  produce compute {
  for (c_out, 0, 64) {
    for (h, 0, 36) {
      for (w, 0, 36) {
        compute[(((c_out*1296) + (h*36)) + w)] = 0f
        for (rc, 0, 128) {
          for (ry, 0, 3) {
            for (rx, 0, 3) {
              compute[(((c_out*1296) + (h*36)) + w)] = (compute[(((c_out*1296) + (h*36)) + w)] + (placeholder[(((((rc*1444) + (h*38)) + (ry*38)) + w) + rx)]*placeholder[((((c_out*1152) + (rc*9)) + (ry*3)) + rx)]))
            }
          }
        }
      }
    }
  }
}

After [oh,ow] fuse && split[Ntohow,tohow] [Ntoc,toc] && reorder (n,Ntoc, Ntohow,tohow,toc,ic,kh,kw):

the lowered stmt as below:

produce compute {
  for (c_out.outer, 0, 4) {
    for (h.w.fused.outer, 0, 36) {
      for (c_out.inner, 0, 16) {
        for (h.w.fused.inner, 0, 36) {
          compute[((((c_out.outer*20736) + (c_out.inner*1296)) + (h.w.fused.outer*36)) + h.w.fused.inner)] = 0f
          for (rc, 0, 128) {
            for (ry, 0, 3) {
              for (rx, 0, 3) {
                compute[((((c_out.outer*20736) + (c_out.inner*1296)) + (h.w.fused.outer*36)) + h.w.fused.inner)] = (compute[((((c_out.outer*20736) + (c_out.inner*1296)) + (h.w.fused.outer*36)) + h.w.fused.inner)] + (placeholder[(((((rc*1444) + (h.w.fused.outer*38)) + (ry*38)) + rx) + h.w.fused.inner)]*placeholder[(((((c_out.outer*18432) + (c_out.inner*1152)) + (rc*9)) + (ry*3)) + rx)]))
              }
            }
          }
        }
      }
    }
  }
}

I want to design one intric to replace innermost 6 loop with on intrinsic reference decl_tensor_intrin . The lowered intrisic as bellow:

produce compute {
  for (w, 0, 36) {
    for (c, 0, 16) {
      compute[((c*36) + w)] = 0f
      for (rc, 0, 128) {
        for (ry, 0, 3) {
          for (rx, 0, 3) {
            compute[((c*36) + w)] = (compute[((c*36) + w)] + (data[((((rc*114) + (ry*38)) + w) + rx)]*kernel[((((c*1152) + (rc*9)) + (ry*3)) + rx)]))
          }
        }
      }
    }
  }
}

when I try to tensorize the innermost loop with this intrisic as

schedule_name[out].tensorize(tohow, intrinsic_block_conv2d)

when lowered it, report errors:

  File "...vm/src/pass/storage_flatten.cc", line 433
TVMError: Check failed: slice->strides.size() == 0U (4 vs. 0) : Trying to bind compact buffer to strided one strides=[184832, 1444, 38, 1]

As I known, the tensorize is complex, will call MakeTensorize, VerifyTensorizeBody,InferBound to check before tensorize intrinsic successfully. so,my question is what should I do when meet this errors, if there are some steps to desgin the compute pattern to full fill tensorize argument. must do I clear the principle of tensorize mechanism? thanks.

Another question is that if can I design intrin_func according to my needs, for example, intrin_func is just simple strings.

def intrin_func(ins, outs):
  ib = tvm.ir_builder.create()
  ib.emit(tvm.stmt.stmt_seq(
      tvm.call_extern("int32", "Command", "read_dma"),
      tvm.call_extern("int32", "Command", "mac"),
      tvm.call_extern("int32", "Command", "write_dma"))
      return ib.get()   

return tvm.decl_tensor_intrin(ofm.op, intrin_func, binds={})

xqch1983 · April 8, 2020, 3:10am

It will work well as expected when STRIPE_LEN sets to 36, that is to say it will not change output tensor shape if STRIPE_LEN is equal with OW(output weight), by codes as bellow,

tvm.decl_buffer(...,strides=[tvm.var("nn"),tvm.var("cc"),tvm.var("hh")，tvm.var("ww")])

codes are here tensorize_conv2d_nchw.py

Actually I want to split new fused ohow to new shape according any STRIPE_LEN(tohow, tiling ohow) value. For example,

Input = [1,128,38,38]
Kernels = [64,128,3,3]
padding = [0,0],strides = [1,1]
STRIPE_LEN = 16

After fuse and split scheudles,new lowered stmt as bellow:

produce compute {
  for (c_out.outer, 0, 4) {
    for (h.w.fused.outer, 0, 81) {
      for (h.w.fused.inner, 0, 16) {
        for (c_out.inner, 0, 16) {
          compute[((((c_out.outer*20736) + (c_out.inner*1296)) + (h.w.fused.outer*16)) + h.w.fused.inner)] =
           0f
          for (rc, 0, 128) {
            for (ry, 0, 3) {
              for (rx, 0, 3) {
                compute[((((c_out.outer*20736) + (c_out.inner*1296)) + (h.w.fused.outer*16)) + h.w.fused.inner)] =
                 (compute[((((c_out.outer*20736) + (c_out.inner*1296)) + (h.w.fused.outer*16)) + h.w.fused.inner)]
                 + (Input[(((((rc*1444) + (floordiv(((h.w.fused.outer*16) + h.w.fused.inner), 36)*38)) + (ry*38)) + rx)
                  + floormod(((h.w.fused.outer*16) + h.w.fused.inner), 36))]
                 *Filter[(((((c_out.outer*18432) + (c_out.inner*1152)) + (rc*9)) + (ry*3)) + rx)]))
              }
            }
          }
        }
      }
    }
  }
}

I design one intrinsic pattern to tensorize the innermost 5 loop,readable stmt described as bellow:

produce compute {
  for (w, 0, 16) {
    for (c, 0, 16) {
      compute[((c*16) + w)] = 0f
      for (rc, 0, 128) {
        for (ry, 0, 3) {
          for (rx, 0, 3) {
            compute[((c*16) + w)] = (compute[((c*16) + w)] + (data[((((rc*48) + (ry*16)) + w) + rx)]*kernel[((((c*1152) + (rc*9)) + (ry*3)) + rx)]))
          }
        }
      }
    }
  }
}

Now, it reported errors:

  File " tvm/src/pass/arg_binder.cc", line 39
TVMError: Bind have an unmet assertion: (bool)0,  on argument OFM.shape[3]

problems became that.

how to set arguments in decl_tensor_intrin to make arg_binder work well if the original compute part includes floordiv, floormod, or other expression like likely in bound checking. if there are some examples or tutorial case to infer,thanks！

Any help is appreciate.

xqch1983 · April 17, 2020, 3:57am

I answer this question myself.I think the reasons that user do not know how to do when meet this TVMError is that it is difficulty in debug TVM between cplusplus and python codes. Here， I shared my methods to fix the error. adding LOG(INFO) in TVM cplusplus codes. such as codes in arg_binder.cc.

void ArgBinder::BindArray(const Array<Expr>& arg,
                          const Array<Expr>& value,
                          const std::string& arg_name) {
  CHECK_EQ(arg.size(), value.size())
      << "Argument " << arg_name << " array size mismatch";
  // LOG(INFO) << "ArgBinder::BindArray #############----arg: " << arg << ", " << " value " << value<<", arg_name "<<arg_name;

The output detail info will help you figure out how to set arguments in decl_tensor_intrin. then the TVM errors will go away.

andrew_sto · August 13, 2021, 2:22pm

Just wanted to say thanks for this thread and providing the answer. Indeed it was not clear to me what to pass to the strides parameter of decl_buffer.

tvm.decl_buffer(...,strides=[tvm.var("nn"),tvm.var("cc"),tvm.var("hh")，tvm.var("ww")])

As I understand now, we can pass either a constant/immediate in which case TVM will check that this value is correct or we can pass a Variable in which case this will be filled in by TVM during lowering and we can get the computed value in the intrinsic definition to pass it to the intrinsic function call.