Matrix multiplication example for Cuda

tkonolige · October 5, 2020, 8:42pm

Kernels running on the GPU require all memory accesses to be within a thread or a block. The file you are looking does not do any thread binding. I suggest looking at this tutorial: https://tvm.apache.org/docs/tutorials/optimize/opt_conv_cuda.html