I need to write a fp16 matmul which uses fp32 to accumulate.
The code is similar to cublasGemmEx when the compute type is set to CUDA_R_32F.
I write the compute like this: But after tuning for 2000 times, the performance is no better than cublas. The m, n, k is set 15360, 1, 5120。cublas takes 0.203 ms, while tvm takes 0.222 ms.
Did I write the write implementation? Or it’s because I didn’t use tensor core?
How to use tensor core to achieve this? Thank you !
@auto_scheduler.register_workload
def matmul_layer(m, n, k):
A = te.placeholder((m, k), name="A", dtype="float16")
B = te.placeholder((k, n), name="B", dtype="float16")
k = te.reduce_axis((0, k), name="k")
C = te.compute((m, n), lambda m, n: te.sum(A[m, k].astype("float32") * B[k, n].astype("float32"), axis=k), name="C")
D = te.compute((m, n), lambda i, j: C[i, j].astype("float16"))
return [A, B, D]