Removing intermediate tensors/replacing to temporary variable

I am generating C code using tensor expression. Currently, a lot of intermediate tensors are generated along the way, and I want to know if I can use e.g., scalar temp variables instead.

Specifically, I am calculating Conv, then batch norm, then ReLU. For performance, I would want to essentially inline BN and Relu to the Conv loop (or Conv loop and BN loop to the Relu, or whatever to only have single loop). However, I cannot inline Conv loop into other loops because Conv loop contains reduce and TVM spits and error saying reductions are only allowed at the top of the loop. Thus, I currently inline the BN loop and use compute_at() to merge the Relu loop and conv loop. For example, I am doing something like this in my tensor expression code:

A = te.placeholer(...)
W = te.placeholder(...)
Conv = te.compute( ... my conv code from A, W)
BN = te.compute( .. BatchNorm using Conv ...)
Relu = te.compute ( .. Relu over BN)
s = te.create_schedule([Relu.op])
s[BN].compute_inline()
s[B].compute_at(s[Relu], Relu.op.axis[3])

The generated C code looks something like this:

for(...)
  for(...)
    for(...)
    Conv[...] = 0;
    for (...)
      for (...)
        for (...)
          Conv[...] += A[...] * W[...]  // Conv loop
    Relu[...] = ReLU(BN(Conv[...])) 

However, I only need Relu[…] array results, so Conv[…] is better replaced by something like a scalar variable. I.e, I want this code to be:

for(...)
  for(...)
    for(...)
    tmp = 0;
    for (...)
      for (...)
        for (...)
          tmp += A[...] * W[...]  // Conv loop
    Relu[...] = ReLU(BN(tmp))

Is there a simple way to do this with te? The explanation seems a bit messy, so hope this makes sense.

Thank you!