Matrix multiplication example for Cuda

Hi! I have been studying how TVM works and I tried out this (https://github.com/apache/incubator-tvm/blob/master/tutorials/autotvm/tune_simple_template.py) tutorial example from the website and it seems like running this example with cuda (or OpenCL) produces errors like:

Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=(‘tutorial/matmul’, 512, 512, 512, ‘float32’). A fallback configuration is used, which may bring great performance regression. Traceback (most recent call last): File “tune_simple_template.py”, line 321, in func = tvm.build(s, arg_bufs) File “/root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/driver/build_module.py”, line 413, in build mod_host, mdev = _build_for_device(input_mod, tar, target_host) File “/root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/driver/build_module.py”, line 255, in _build_for_device mod_mixed = tvm.transform.Sequential(opt_mixed)(mod_mixed) File “/root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/ir/transform.py”, line 127, in call return _ffi_transform_api.RunPass(self, mod) File “tvm/_ffi/_cython/./packed_func.pxi”, line 321, in tvm._ffi._cy3.core.PackedFuncBase.call File “tvm/_ffi/_cython/./packed_func.pxi”, line 256, in tvm._ffi._cy3.core.FuncCall File “tvm/_ffi/_cython/./packed_func.pxi”, line 245, in tvm._ffi._cy3.core.FuncCall3 File “tvm/_ffi/_cython/./base.pxi”, line 160, in tvm._ffi._cy3.core.CALL tvm._ffi.base.TVMError: Traceback (most recent call last): [bt] (5) /root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(TVMFuncCall+0x65) [0x7f0f613a6035] [bt] (4) /root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(+0x6d4af6) [0x7f0f6097caf6] [bt] (3) /root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x2c8) [0x7f0f6097b8f8] [bt] (2) /root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(tvm::transform::ModulePassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x12f) [0x7f0f6097c5af] [bt] (1) /root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(+0x8c352d) [0x7f0f60b6b52d] [bt] (0) /root/.local/lib/python3.6/site-packages/tvm-0.8.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(+0x8c00a2) [0x7f0f60b680a2] Did you forget to bind? Variable B is directly accessed by host memory (it is not contained in a thread environment or in the function arguments. Variable A is directly accessed by host memory (it is not contained in a thread environment or in the function arguments. Variable C is directly accessed by host memory (it is not contained in a thread environment or in the function arguments. Variable C is directly accessed by host memory (it is not contained in a thread environment or in the function arguments. Variable C is directly accessed by host memory (it is not contained in a thread environment or in the function arguments. File “/local/incubator-tvm/src/tir/analysis/verify_memory.cc”, line 202 RuntimeError: Memory verification failed with the following errors: PrimFunc([A, B, C]) attrs={“global_symbol”: “default_function”, “tir.noalias”: (bool)1, “target”: cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32} { for (i.outer, 0, 512) { for (j.outer, 0, 512) { C[((i.outer512) + j.outer)] = 0f for (k, 0, 512) { C[((i.outer512) + j.outer)] = (C[((i.outer512) + j.outer)] + (A[((i.outer512) + k)]B[((k512) + j.outer)])) } } } }

Is there any quick fix I can modify to demonstrating GEMM optimization on GPUs? Any pointers are approciated!

Kernels running on the GPU require all memory accesses to be within a thread or a block. The file you are looking does not do any thread binding. I suggest looking at this tutorial: https://tvm.apache.org/docs/tutorials/optimize/opt_conv_cuda.html