[BUG] storage rewrite doesn't correctly handle local buffer allocation

In some cases local buffer is allocated in global memory as passed to the kernel as an argument.
When we use local buffer, the original IR is like

produce foo {
  // attr [tmp] storage_scope = "local"
  allocate total[float32 * 1]
  ...
}

After storage rewrite pass, the IR becomes

// attr [tmp] storage_scope = "local"
allocate total[float32 * 1]
produce foo {
 ...
}

This change is because the attach scope of the local buffer is nullptr if it is the first statement inside produce.
This bug only happens in ir builder or hybrid script.
In IR builder we have the workaround that we declare thread scope before allocation:

    ib.scope_attr(tx, "thread_extent", nthread_tx)
    tmp = ib.allocate(...)

But in hybrid we still have this issue.

Here is a reproducible hybrid script:

import tvm
@tvm.hybrid.script
def foo(a):
    c = output_tensor((a.shape[0],), a.dtype)
    total = allocate((1,), a.dtype, 'local')
    len_i = a.shape[0]
    len_j = a.shape[1]
    for i in bind('threadIdx.x', len_i):
        total[0] = 0.
        for k in const_range(len_j):
            total[0] += a[i, k]
        c[i] = total[0]

    return c

a_tvm = tvm.placeholder((8,4), 'float32')
c_tvm = foo(a_tvm)
s = tvm.create_schedule(c_tvm.op)
print(tvm.lower(s, [a_tvm, c_tvm], simple_mode=True))

@were could you take a look? I think this can be fixed in hybrid

Sorry for my late response.

OK. I think I can fix it. I used to suffer from a problem like this, and you gave me a formal toy example to replicate.