How to I use prefetch with GPU codegen

Yuankai · July 14, 2020, 5:07am

I want to prefetch data from global memory so that the memory operation can be overlapped with computation. I found the prefetch command in TVM, but haven’t seen any example using it.

I tried with a simple piece of code, and it created TVM IR fine, but it errors out when generating CUDA code.

N = 1024
A = te.placeholder((N,), name='A', dtype='float')
B = te.compute((N,), **lambda** n: A[n]*2+1, name='add')
s = te.create_schedule(B.op)
n = B.op.axis[0]
t, ti = s[B].split(n, factor=256)
ti, to = s[B].split(ti, factor=16)
s[B].bind(t, te.thread_axis('blockIdx.x'))
s[B].bind(to, te.thread_axis('threadIdx.x'))
s[B].prefetch(A, ti, 1)
with tvm.target.create("cuda"):
     print(tvm.lower(s, [A, B], simple_mode=True))
     func = tvm.build(s, [A, B])
     dev_module = func.imported_modules[0]
     print(dev_module.get_source())

Error:

tvm.ffi.base.TVMError: Traceback (most recent call last): [bt] (8) /home/yuankai.chen/opensource/tvm_2/tvm/build/libtvm.so(tvm::codegen::CodeGenC::VisitStmt(tvm::tir::ForNode const*)+0x1cf) [0x7fc2dd650f0f] [bt] (7) /home/yuankai.chen/opensource/tvm_2/tvm/build/libtvm.so(tvm::NodeFunctor<void (tvm::runtime::ObjectRef const&, tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>)>::operator()(tvm::runtime::ObjectRef const&, tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>) const+0x3f) [0x7fc2dd2d65ff] [bt] (6) /home/yuankai.chen/opensource/tvm_2/tvm/build/libtvm.so(tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::EvaluateNode const*)+0x80) [0x7fc2dd662bb0] [bt] (5) /home/yuankai.chen/opensource/tvm_2/tvm/build/libtvm.so(tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::EvaluateNode const*)+0x7a) [0x7fc2dd6533da] [bt] (4) /home/yuankai.chen/opensource/tvm_2/tvm/build/libtvm.so(tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&)+0x36) [0x7fc2dd65ca16] [bt] (3) /home/yuankai.chen/opensource/tvm_2/tvm/build/libtvm.so(tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)+0x9d) [0x7fc2dd64b7cd] [bt] (2) /home/yuankai.chen/opensource/tvm_2/tvm/build/libtvm.so(tvm::codegen::CodeGenCUDA::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)+0x390) [0x7fc2dd666df0] [bt] (1) /home/yuankai.chen/opensource/tvm_2/tvm/build/libtvm.so(tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)+0x70) [0x7fc2dd653a40] [bt] (0) /home/yuankai.chen/opensource/tvm_2/tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x35) [0x7fc2dd1ed565] File “/home/yuankai.chen/opensource/tvm_2/tvm/src/target/source/codegen_c.cc”, line 633 TVMError: Unresolved intrinsic prefetch with return type float32

Maximilianxu · October 13, 2020, 5:17am

I am facing the same problem with prefetching data from shared mem to registers.

Did you solve this?

tqchen · October 13, 2020, 1:18pm

prefetch is not supported in most GPUs. Most cases cache prefetching happens automatically and you can use cache_read to get the data into shared meory, see examples in https://tvm.apache.org/docs/tutorials/optimize/opt_conv_cuda.html#sphx-glr-tutorials-optimize-opt-conv-cuda-py