CUTLASS Support

@masahi I tried running the cutlass code, and compared it against PyTorch batched multiplication, but for some reason the outputs don’t match.

The way you provide the input to cutlass kernel is as follows

x_np = np.random.uniform(-1, 1, (bsz, d1, d3)).astype("float32")
y_np = np.random.uniform(-1, 1, (bsz, d2, d3)).astype("float32")

x = tvm.nd.array(x_np, device=dev)
y = tvm.nd.array(y_np, device=dev)

To provide the same input to the PyTorch kernel I used following code.

a = torch.from_numpy(x_np)
b = torch.from_numpy(y_np)
a.to("cuda")
b.to("cuda")

res = torch.matmul(a, b.transpose(dim0=1, dim1=2))

Now the output of the CUTLASS kernel is in the form of tvm.ndarray thus I converted it to torch tensor.

cutlass_res = torch.from_numpy(rt_mod.get_output(0).numpy())

and when I matched the two outputs, it gives false, but they do have the same shape.

torch.equal(res, R)

Out[]: False

Do you have any idea why the results don’t match?

Solved the problem by removing the temp directory that is created when profile and build function is invoked.