I am generating C code using tensor expression. Currently, a lot of intermediate tensors are generated along the way, and I want to know if I can use e.g., scalar temp variables instead.
Specifically, I am calculating Conv, then batch norm, then ReLU. For performance, I would want to essentially inline BN and Relu to the Conv loop (or Conv loop and BN loop to the Relu, or whatever to only have single loop). However, I cannot inline Conv loop into other loops because Conv loop contains reduce and TVM spits and error saying reductions are only allowed at the top of the loop. Thus, I currently inline the BN loop and use compute_at() to merge the Relu loop and conv loop. For example, I am doing something like this in my tensor expression code:
A = te.placeholer(...)
W = te.placeholder(...)
Conv = te.compute( ... my conv code from A, W)
BN = te.compute( .. BatchNorm using Conv ...)
Relu = te.compute ( .. Relu over BN)
s = te.create_schedule([Relu.op])
s[BN].compute_inline()
s[B].compute_at(s[Relu], Relu.op.axis[3])
The generated C code looks something like this:
for(...)
for(...)
for(...)
Conv[...] = 0;
for (...)
for (...)
for (...)
Conv[...] += A[...] * W[...] // Conv loop
Relu[...] = ReLU(BN(Conv[...]))
However, I only need Relu[…] array results, so Conv[…] is better replaced by something like a scalar variable. I.e, I want this code to be:
for(...)
for(...)
for(...)
tmp = 0;
for (...)
for (...)
for (...)
tmp += A[...] * W[...] // Conv loop
Relu[...] = ReLU(BN(tmp))
Is there a simple way to do this with te? The explanation seems a bit messy, so hope this makes sense.
Thank you!