How to I use prefetch with GPU codegen

I want to prefetch data from global memory so that the memory operation can be overlapped with computation. I found the prefetch command in TVM, but haven’t seen any example using it.

I tried with a simple piece of code, and it created TVM IR fine, but it errors out when generating CUDA code.

N = 1024
A = te.placeholder((N,), name='A', dtype='float')
B = te.compute((N,), **lambda** n: A[n]*2+1, name='add')
s = te.create_schedule(B.op)
n = B.op.axis[0]
t, ti = s[B].split(n, factor=256)
ti, to = s[B].split(ti, factor=16)
s[B].bind(t, te.thread_axis('blockIdx.x'))
s[B].bind(to, te.thread_axis('threadIdx.x'))
s[B].prefetch(A, ti, 1)
with tvm.target.create("cuda"):
     print(tvm.lower(s, [A, B], simple_mode=True))
     func = tvm.build(s, [A, B])
     dev_module = func.imported_modules[0]
     print(dev_module.get_source())

Error:

tvm.ffi.base.TVMError: Traceback (most recent call last): [bt] (8) /home/yuankai.chen/opensource/tvm_2/tvm/build/libtvm.so(tvm::codegen::CodeGenC::VisitStmt(tvm::tir::ForNode const*)+0x1cf) [0x7fc2dd650f0f] [bt] (7) /home/yuankai.chen/opensource/tvm_2/tvm/build/libtvm.so(tvm::NodeFunctor<void (tvm::runtime::ObjectRef const&, tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>)>::operator()(tvm::runtime::ObjectRef const&, tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>) const+0x3f) [0x7fc2dd2d65ff] [bt] (6) /home/yuankai.chen/opensource/tvm_2/tvm/build/libtvm.so(tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::EvaluateNode const*)+0x80) [0x7fc2dd662bb0] [bt] (5) /home/yuankai.chen/opensource/tvm_2/tvm/build/libtvm.so(tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::EvaluateNode const*)+0x7a) [0x7fc2dd6533da] [bt] (4) /home/yuankai.chen/opensource/tvm_2/tvm/build/libtvm.so(tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&)+0x36) [0x7fc2dd65ca16] [bt] (3) /home/yuankai.chen/opensource/tvm_2/tvm/build/libtvm.so(tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)+0x9d) [0x7fc2dd64b7cd] [bt] (2) /home/yuankai.chen/opensource/tvm_2/tvm/build/libtvm.so(tvm::codegen::CodeGenCUDA::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)+0x390) [0x7fc2dd666df0] [bt] (1) /home/yuankai.chen/opensource/tvm_2/tvm/build/libtvm.so(tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)+0x70) [0x7fc2dd653a40] [bt] (0) /home/yuankai.chen/opensource/tvm_2/tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x35) [0x7fc2dd1ed565] File “/home/yuankai.chen/opensource/tvm_2/tvm/src/target/source/codegen_c.cc”, line 633 TVMError: Unresolved intrinsic prefetch with return type float32

I am facing the same problem with prefetching data from shared mem to registers.

Did you solve this?

prefetch is not supported in most GPUs. Most cases cache prefetching happens automatically and you can use cache_read to get the data into shared meory, see examples in https://tvm.apache.org/docs/tutorials/optimize/opt_conv_cuda.html#sphx-glr-tutorials-optimize-opt-conv-cuda-py