Thanks for your help.
We are now trying to add a accelerator backend into tvm, but we encountered problem related to tensorize and reduction. First we defined a “vector add” tensor intrin like this:
shape = (16, ) dtype = 'int16' in1 = tvm.placeholder(shape, dtype, 'in1') in2 = tvm.placeholder(shape, dtype, 'in2') out = tvm.compute(shape, lambda i: in1[i] + in2[i], 'out') in1_buf, in2_buf, out_buf = decl_bufs for tensors def lower_func(ins, outs): din1, din2 = ins[0], ins[1] dout = outs[0] irb = tvm.ir_builder.create() irb.scope_attr(env.nnpu_axis, "coproc_scope", 0) irb.emit(tvm.call_extern("int32", 'NNPU_VAddV', dout.access_ptr('w', 'uint32'), din1.access_ptr('r', 'uint32'), din2.access_ptr('r', 'uint32'), shape[0], )) return irb.get() return tvm.decl_tensor_intrin(out.op, lower_func, name='VAddV', binds={in1: in1_buf, in2: in2_buf, out: out_buf})
And then we tried to use this “VAddV” intrin to sum all rows of a matrix up like this:
shape = (16, 16) dtype = ‘int16’ A = tvm.placeholder(shape, dtype, ‘A’) A_buf = tensor A copied to accelerator
k = tvm.reduce_axis((0, 16), ‘k’) B_buf = tvm.compute((16, ), lambda i: tvm.sum(a_buf[k, i], k), ‘B_buf’) B_host = tensor B_buf copied back to host
s = tvm.create_schedule(B_host.op) s[B_buf].reorder(s[B_buf].op.reduce_axis[0], s[B_buf].op.axis[0]) s[B_buf].tensorize(s[B_buf].op.axis[0], env.intrins[‘VAddV’])
// other unrelated code are omitted, such as pragma and set_scope.
print(nnpu.lower(s, [a, B_host], simple_mode=True))
Then the lower phrase fails with error:
Traceback (most recent call last): File "test_batch_reduce.py", line 52, in <module> test() File "test_batch_reduce.py", line 28, in test print(nnpu.lower(s, [a, b_host], simple_mode=True)) File "/home/jian/repositories/tvm/nnpu/python/nnpu/build_module.py", line 66, in lower return tvm.lower(*args, **kwargs) File "/home/jian/environments/tvm-env/local/lib/python2.7/site-packages/tvm-0.5.dev0-py2.7-linux-x86_64.egg/tvm/build_module.py", line 341, in lower stmt = schedule.ScheduleOps(sch, bounds) File "/home/jian/environments/tvm-env/local/lib/python2.7/site-packages/tvm-0.5.dev0-py2.7-linux-x86_64.egg/tvm/_ffi/_ctypes/function.py", line 185, in __call__ ctypes.byref(ret_val), ctypes.byref(ret_tcode))) File "/home/jian/environments/tvm-env/local/lib/python2.7/site-packages/tvm-0.5.dev0-py2.7-linux-x86_64.egg/tvm/_ffi/base.py", line 68, in check_call raise TVMError(py_str(_LIB.TVMGetLastError())) tvm._ffi.base.TVMError: [10:59:17] /home/jian/repositories/tvm/src/op/tensorize.cc:195: Check failed: inputs.size() == intrin->inputs.size() (1 vs. 2)
the IR code before tensorize is:
produce b_buf { for (i.init, 0, 16) { a_buf[(i.init + 256)] = (int16)0 } for (k, 0, 16) { for (i, 0, 16) { a_buf[(i + 256)] = (a_buf[((k*16) + i)] + a_buf[(i + 256)]) } } }
Even if we returned a 3-tuple expressing (body, init, update), it failes too.
I wonder is it possible to do a tensorized reduction in some way with tvm? I thought it is a pretty resonable usage. Thanks sincerely!