How to bind thread after tensorized cache_read?

Hi
I want to tensorize cache_read by using extern function “memcpy”. But after this operation, I can’t bind thread any more.
The origin device code like is:

nram A_nram[128];
nram B_nram[128];
memcpy(A_nram, A, 64);
for(int32_t i=0; i<4; i++){
    B_nram[i] = A_nram[i] * 2;
}
memcpy(B, B_nram, 64);

In schedule, I wanna use s[B].bind(ni, threadx), but it call:
Bind have a unmet assertion: (uint1)0, on argument Aa.shape[0]

Does someone met this problem before? Please tell me some solution if you have some idea. Thank you!

Besides, I write tensorize function as follow:
def mlu_cache_read(l, dtype):
a = tvm.placeholder((l,), name=‘a’, dtype=dtype)
b = tvm.compute((l,), lambda i: a[i], name=‘b’)
Aa = tvm.decl_buffer(a.shape, a.dtype,
name=“Aa”,
offset_factor=1,
strides=[1])
Bb = tvm.decl_buffer(b.shape, b.dtype,
name=“Bb”,
offset_factor=1,
strides=[1], scope=‘nram’)

def cache_read(in_, out):
    ib = tvm.ir_builder.create()
    aa = in_[0]
    bb = out[0]
    ib.emit(tvm.call_extern("", "__memcpy", 
            bb.access_ptr('w'), aa.access_ptr('r'), 
            l*4, "enum(GDRAM2NRAM)"))
    return ib.get()
with tvm.build_config(offset_factor=1):
    return tvm.decl_tensor_intrin(b.op, cache_read, binds={a: Aa, b: Bb})

what’s progress for this problem?
Interesting!