Can I achieve a 'tensorized' reduction in tvm?

Thanks for your help.

We are now trying to add a accelerator backend into tvm, but we encountered problem related to tensorize and reduction. First we defined a “vector add” tensor intrin like this:

shape = (16, )
dtype = 'int16'
in1 = tvm.placeholder(shape, dtype, 'in1')
in2 = tvm.placeholder(shape, dtype, 'in2')
out = tvm.compute(shape, lambda i: in1[i] + in2[i], 'out')
in1_buf, in2_buf, out_buf = decl_bufs for tensors
def lower_func(ins, outs):
    din1, din2 = ins[0], ins[1]
    dout = outs[0]
    irb = tvm.ir_builder.create()
    irb.scope_attr(env.nnpu_axis, "coproc_scope", 0)
    irb.emit(tvm.call_extern("int32", 'NNPU_VAddV',
           dout.access_ptr('w', 'uint32'),
           din1.access_ptr('r', 'uint32'),
           din2.access_ptr('r', 'uint32'),
           shape[0],
           ))
    return irb.get()
return tvm.decl_tensor_intrin(out.op, lower_func,
                              name='VAddV',
                              binds={in1: in1_buf,
                                     in2: in2_buf,
                                     out: out_buf})

And then we tried to use this “VAddV” intrin to sum all rows of a matrix up like this:

shape = (16, 16) dtype = ‘int16’ A = tvm.placeholder(shape, dtype, ‘A’) A_buf = tensor A copied to accelerator

k = tvm.reduce_axis((0, 16), ‘k’) B_buf = tvm.compute((16, ), lambda i: tvm.sum(a_buf[k, i], k), ‘B_buf’) B_host = tensor B_buf copied back to host

s = tvm.create_schedule(B_host.op) s[B_buf].reorder(s[B_buf].op.reduce_axis[0], s[B_buf].op.axis[0]) s[B_buf].tensorize(s[B_buf].op.axis[0], env.intrins[‘VAddV’])

// other unrelated code are omitted, such as pragma and set_scope.

print(nnpu.lower(s, [a, B_host], simple_mode=True))

Then the lower phrase fails with error:

Traceback (most recent call last):
  File "test_batch_reduce.py", line 52, in <module>
    test()
  File "test_batch_reduce.py", line 28, in test
    print(nnpu.lower(s, [a, b_host], simple_mode=True))
  File "/home/jian/repositories/tvm/nnpu/python/nnpu/build_module.py", line 66, in lower
    return tvm.lower(*args, **kwargs)
  File "/home/jian/environments/tvm-env/local/lib/python2.7/site-packages/tvm-0.5.dev0-py2.7-linux-x86_64.egg/tvm/build_module.py", line 341, in lower
    stmt = schedule.ScheduleOps(sch, bounds)
  File "/home/jian/environments/tvm-env/local/lib/python2.7/site-packages/tvm-0.5.dev0-py2.7-linux-x86_64.egg/tvm/_ffi/_ctypes/function.py", line 185, in __call__
    ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
  File "/home/jian/environments/tvm-env/local/lib/python2.7/site-packages/tvm-0.5.dev0-py2.7-linux-x86_64.egg/tvm/_ffi/base.py", line 68, in check_call
    raise TVMError(py_str(_LIB.TVMGetLastError()))
tvm._ffi.base.TVMError: [10:59:17] /home/jian/repositories/tvm/src/op/tensorize.cc:195: Check failed: inputs.size() == intrin->inputs.size() (1 vs. 2)

the IR code before tensorize is:

produce b_buf {
  for (i.init, 0, 16) {
    a_buf[(i.init + 256)] = (int16)0
  }
  for (k, 0, 16) {
    for (i, 0, 16) {
      a_buf[(i + 256)] = (a_buf[((k*16) + i)] + a_buf[(i + 256)])
    }
  }
}

Even if we returned a 3-tuple expressing (body, init, update), it failes too.

I wonder is it possible to do a tensorized reduction in some way with tvm? I thought it is a pretty resonable usage. Thanks sincerely!

Yes, you can, try to take a look at VTA examples, e.g. https://docs.tvm.ai/vta/tutorials/matrix_multiply.html#sphx-glr-vta-tutorials-matrix-multiply-py

Which teaches you to how to tensorize a reduction(break a matmul into sum of many small matmuls)

Thanks @tqchen for your reply! But the two are somewhat different, what I expect the tensorize to generate is: using VAddV intrin to iteratively add every row in input matrix to result, similar to a vectorized reduce sum, like this:

handwrite pseudocode:
produce b_buf {
  NNPU_Init((uint32)256, 0, 16)
  for (k, 0, 16) {
    NNPU_VAddV((uint32)256, (uint32)k*16, (uint32)256)
  }
}

from the original IR:

produce b_buf {
  for (i.init, 0, 16) {
    a_buf[(i.init + 256)] = (int16)0
  }
  for (k, 0, 16) {
    for (i, 0, 16) {
      a_buf[(i + 256)] = (a_buf[((k*16) + i)] + a_buf[(i + 256)])
    }
  }
}

Another example is max pooling, if the accelerator has a “vector greater than merge” instruction: VGTMV, can we use tensorize to do a max pooling iteratively with this insn? It seems that VTA doesn’t support max pooling.

Thanks sincerely!

I understand what you say, but VTA example does relates to what you are asking for.

Specifically, in VTA, the basic matmul unit could be 4x4 matrix times vector of size 4. Imagine you want to do (4*n)x4 matrix times a vector of size 4, you will need to accumulate over n, to generate

output[:] = 0
for (i, 0, n) {
   gemv(output[:], input[4*i+4, :], weight[:])
}

Note in here although the original intrinsic is declared as 4x4 matrix multiplication, we allow break down of larger reduction into this intrinsic.

See also https://docs.tvm.ai/tutorials/language/tensorize.html#reduce-update-for-tensorize

There is, however, some limitation in the current pattern-matcher. Specifically, we expect a reduction dimension when trying to pattern match a reduced reduction dimension. If you introduce a dummy reduction dimension k(whose range equals (0,1)), and declare your compute as

input  = placeholder((1, 16), dtype)
k = tvm.redue_axis((0, 1), "k")
out = tvm.compute(shape, lambda i: tvm.sum(input[k, i], axis=k)) 

Then declare the reduce update expression as in the tutorial might solve your problem.

also cc @merrymercy @thierry @cowanmeg @yzhliu @xqdan who might have insights into this question

@Jina.Hu Indeed we don’t have an operator implemented for pooling in VTA, but the hardware supports it - this is something we can implement if you think it would serve as a good example?

We wouldn’t rely upon tensorization to implement pooling, rather we’d implement it as a standard TOPI operator by defining data access patterns and mapping the arithmetic operations to a VGTMV-like instruction.

Thanks for your reply!
This problem is solved by adding a dummy reduction axis in intrin compute definition.:grinning:

Thanks for your reply!
This problem is solved by adding a dummy reduction axis in intrin compute definition, as tqchen mentioned, this may be a limitation in pattern-matcher.
It may be very helpful to introduce this trick in tutorial.:grinning:

Contribution to the tutorial is more than welcomed !