# Understanding Tensorization Details

I am trying to wrap my head around the tensorization function in order to integrate a custom accelerator, but I am running into a lot of problems and hope that someone can provide some clarity on what is going on under the hood.

I am starting out with a simple matmul:

``````I = 32
K = 128
J = 64

factor = 16

a_shape = (I, K)
b_shape = (K, J)
c_shape = (I, J)

# calculate A @ B + D = C
a = te.placeholder(a_shape, dtype="int8", name="a_in")
b = te.placeholder(b_shape, dtype="int8", name="b_in")
c = te.placeholder(c_shape, dtype="int32", name="c_out")

k_o = te.reduce_axis((0, K), name="k_o")

res = te.compute(
c_shape,
lambda r_o, c_o: te.sum(
a[r_o, k_o].astype(ENV.inp_dtype)
* b[k_o, c_o].astype(ENV.inp_dtype),
axis=[k_o],
),
name="res",
tag="dense",
)
``````

That is then split into multiple levels:

``````sch = te.create_schedule(res.op)
outer_i, inner_i = sch[res].split(res.op.axis[0], factor=factor)
outer_j, inner_j = sch[res].split(res.op.axis[1], factor=factor)
outer_k, inner_k = sch[res].split(res.op.reduce_axis[0], factor=factor)
``````

In the end, I want to use tensorize to map the three innermost loops onto hardware. My initial understanding was that `tensorize` just tries to map a general “three nested loops with upper bound equals 16” structure. But that turned out to not be the case. If we define the computation and reorder the loops like this

``````sch[res].reorder(outer_i, outer_j, outer_k,
inner_k, inner_j, inner_i)
``````

And use this intrinsic:

``````def intrin_gemm(
n: int,
c: int,
m: int,
):

""" GEMM of NxC and CxM matrices"""
a_shape = (n, c)
b_shape = (c, m)
d_shape= (n, m)

rc = te.reduce_axis((0, c), name="ric")
a = te.placeholder(a_shape, dtype=ENV.inp_dtype, name="ifmap_tile")
b = te.placeholder(b_shape, dtype=ENV.wgt_dtype, name="kernel_tile")
d = te.placeholder(d_shape, ENV.inp_dtype, name="out_tile")
Aa = tvm.tir.decl_buffer(a.shape, a.dtype, name="Ifmap_buf",   strides=[te.var("aa_s1"), te.var("aa_s2")])
Bb = tvm.tir.decl_buffer(b.shape, b.dtype, name="Kernel_buf",  strides=[te.var("bb_s1"), te.var("bb_s2")])
Dd = tvm.tir.decl_buffer(d.shape, d.dtype, name="Out_buf",     strides=[te.var("dd_s1"), te.var("dd_s2")])

res = te.compute(
d_shape,
lambda no, mo: te.sum(
a[no, rc].astype(ENV.inp_dtype)
* b[rc, mo].astype(ENV.inp_dtype),
axis=[rc],
),
name="res"
)

def intrin_func(ins, outs):
ifm, ker = ins
res = outs[0]
def _body():
ib = tvm.tir.ir_builder.create()
ib.emit(
tvm.tir.call_extern(
"",
"gemm_kernel",
)
)
return ib.get()

def _reduce_reset():
ib = tvm.tir.ir_builder.create()
ib.emit(
tvm.tir.call_extern(
"",
"gemm_reduce_reset",
)
)
return ib.get()

def _reduce_update():
ib = tvm.tir.ir_builder.create()
ib.emit(
tvm.tir.call_extern(
"",
"gemm_reduce_update",
)
)
return ib.get()

return _body(), _reduce_reset(), _reduce_update()

return te.decl_tensor_intrin(res.op, intrin_func, binds={a: Aa, b: Bb, d: Dd})
``````

I get an error: `Cannot bind a compact buffer res to a strided buffer res_slice with strides [64, 1]`. So, `tensorize` requires some information about the buffer layout and access pattern. But according to the tensorize tutorial, TVM should be able to figure out the strides on its own if I let them bind to a `te.var`, so why is that not the case?

The second thing I do not get is how loop order effects the outcome of tensorization. If I reorder differently:

`sch[res].reorder(outer_i, outer_j, outer_k, inner_k, inner_j, inner_i)`

I get another error: `TVMError: Bind have an unmet assertion: T.bool(False), on argument tensir_intrin.reduction.extent`. I don’t understand why the different order would not result in the same error, I thought `tensorize` just replaces the for loops, regardless of their order. So if that is not the case, how exactly does `tensorize` operate?

After some more experimentation with the GEMV example, I want to also ask what exactly `offset_factor`, and `elem_offset` denote. If I don’t manually fix `offset_factor=1` in the tutorial, I get this error:

`Check failed: (is_zero(value->elem_offset)) is false: Trying to bind a Buffer with offset into one without offset required elem_offset=0, provided elem_offset=i * 64`

I did not find any documentation on how these two values are interacting, or how the default values are inferred.

Hi， brother，did you solve the problem？

I have mostly moved my stuff over to the TIR scheduling infrastructure. What I now know is that tensorization does require a certain loop order that must match between the intrinsic and the part of the schedule that will be replaced.

There is also some documentation on `elem_offset` and `offset_factor` available for that: https://tvm.apache.org/docs/reference/api/python/tir.html#tvm.tir.decl_buffer