How to I use prefetch with GPU codegen

I want to prefetch data from global memory so that the memory operation can be overlapped with computation. I found the prefetch command in TVM, but haven’t seen any example using it.

I tried with a simple piece of code, and it created TVM IR fine, but it errors out when generating CUDA code.

N = 1024
A = te.placeholder((N,), name='A', dtype='float')
B = te.compute((N,), **lambda** n: A[n]*2+1, name='add')
s = te.create_schedule(B.op)
n = B.op.axis[0]
t, ti = s[B].split(n, factor=256)
ti, to = s[B].split(ti, factor=16)
s[B].bind(t, te.thread_axis('blockIdx.x'))
s[B].bind(to, te.thread_axis('threadIdx.x'))
s[B].prefetch(A, ti, 1)
     print(tvm.lower(s, [A, B], simple_mode=True))
     func =, [A, B])
     dev_module = func.imported_modules[0]


tvm.ffi.base.TVMError: Traceback (most recent call last): [bt] (8) /home/yuankai.chen/opensource/tvm_2/tvm/build/ const*)+0x1cf) [0x7fc2dd650f0f] [bt] (7) /home/yuankai.chen/opensource/tvm_2/tvm/build/<void (tvm::runtime::ObjectRef const&, tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>)>::operator()(tvm::runtime::ObjectRef const&, tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>) const+0x3f) [0x7fc2dd2d65ff] [bt] (6) /home/yuankai.chen/opensource/tvm_2/tvm/build/ const*)+0x80) [0x7fc2dd662bb0] [bt] (5) /home/yuankai.chen/opensource/tvm_2/tvm/build/ const*)+0x7a) [0x7fc2dd6533da] [bt] (4) /home/yuankai.chen/opensource/tvm_2/tvm/build/ const&)+0x36) [0x7fc2dd65ca16] [bt] (3) /home/yuankai.chen/opensource/tvm_2/tvm/build/ const&, std::ostream&)+0x9d) [0x7fc2dd64b7cd] [bt] (2) /home/yuankai.chen/opensource/tvm_2/tvm/build/ const*, std::ostream&)+0x390) [0x7fc2dd666df0] [bt] (1) /home/yuankai.chen/opensource/tvm_2/tvm/build/ const*, std::ostream&)+0x70) [0x7fc2dd653a40] [bt] (0) /home/yuankai.chen/opensource/tvm_2/tvm/build/ [0x7fc2dd1ed565] File “/home/yuankai.chen/opensource/tvm_2/tvm/src/target/source/”, line 633 TVMError: Unresolved intrinsic prefetch with return type float32

I am facing the same problem with prefetching data from shared mem to registers.

Did you solve this?

prefetch is not supported in most GPUs. Most cases cache prefetching happens automatically and you can use cache_read to get the data into shared meory, see examples in