The following is a matrix multiplication where I only care about the diagonal band of the result. I have two questions:
1- Any suggestions how to optimize this code for GPU?
2- The formula in tvm.compute accesses locations that are outside the memory allocated for X and Y, and surprisingly, the code works fine and doesn’t crash (it just writes some garbage values in certain locations in the result tensor Z which is expected. Is this expected? Can I rely on this that the code won’t give segfault in the future? is there a better solution?
import tvm
W = 128
n = tvm.convert(16304)
m = tvm.convert(64)
c = tvm.convert(2 * W + 1)
X = tvm.placeholder((n, m), name='X')
Y = tvm.placeholder((n, m), name='Y')
k = tvm.reduce_axis((0, m), name='k')
Z = tvm.compute((n, c), lambda i, j : tvm.sum(X[i, k] * Y[ i + j - W, k], axis=k), name='Z')
s = tvm.create_schedule(Z.op)
s[Z].bind(Z.op.axis[0], tvm.thread_axis("blockIdx.x")) # not great binding
s[Z].bind(Z.op.axis[1], tvm.thread_axis("blockIdx.y")) # not great binding
fmm = tvm.build(s, [X, Y, Z], 'cuda', target_host='llvm', name='fmm')
The second question,:
In my practice, it will crash in most time if using target cpu, and it’s unpredictable using target cuda.
So I’m also wondering the interpretation about how tvm handle the boundary conditions?
Below is a gauss blur example, both input and output shape need to be (1080,1920), template shape is (3,3), the memory access of A will over the boundary and cause invalid memory error.
M = 1080
N = 1920
KH = 3
KW = 3
A = tvm.placeholder((M, N), name='A', dtype=dtype)
K = tvm.placeholder((KH, KW), name='K', dtype=dtype)
ry = tvm.reduce_axis((0, KH), name='ry')
rx = tvm.reduce_axis((0, KW), name='rx')
C = tvm.compute((M, N), lambda i, j: tvm.sum((A[i + ry - 1, j + rx - 1] *K[ry, rx]), axis=[ry, rx]), name='C')
If i want to avoid over boundary, i can only change the C shape to (1080-2,1920-2) like this:
C = tvm.compute((M-2, N-2), lambda i, j: tvm.sum((A[i + ry, j + rx] *K[ry, rx]), axis=[ry, rx]), name=‘C’)
And at last do a if_then_else transformation for C from shape (1080-2,1920-2) to (1080,1920),this always cause performance down on CPU.
Another way is do if_else_then in compute C within tvm.sum(if_then_else can’t be used outside the tvm.sum), this method also cause performance down on CPU.