@masahi I tried running the cutlass code, and compared it against PyTorch batched multiplication, but for some reason the outputs don’t match.
The way you provide the input to cutlass kernel is as follows
x_np = np.random.uniform(-1, 1, (bsz, d1, d3)).astype("float32")
y_np = np.random.uniform(-1, 1, (bsz, d2, d3)).astype("float32")
x = tvm.nd.array(x_np, device=dev)
y = tvm.nd.array(y_np, device=dev)
To provide the same input to the PyTorch kernel I used following code.
a = torch.from_numpy(x_np)
b = torch.from_numpy(y_np)
a.to("cuda")
b.to("cuda")
res = torch.matmul(a, b.transpose(dim0=1, dim1=2))
Now the output of the CUTLASS kernel is in the form of tvm.ndarray thus I converted it to torch tensor.
cutlass_res = torch.from_numpy(rt_mod.get_output(0).numpy())
and when I matched the two outputs, it gives false, but they do have the same shape.
torch.equal(res, R)
Out[]: False
Do you have any idea why the results don’t match?