How to write mixed precision matmul using tensor core?

freshbird2023 · August 2, 2023, 1:49am

I need to write a fp16 matmul which uses fp32 to accumulate.

The code is similar to cublasGemmEx when the compute type is set to CUDA_R_32F.

I write the compute like this: But after tuning for 2000 times, the performance is no better than cublas. The m, n, k is set 15360, 1, 5120。cublas takes 0.203 ms, while tvm takes 0.222 ms.

Did I write the write implementation? Or it’s because I didn’t use tensor core?

How to use tensor core to achieve this? Thank you !

@auto_scheduler.register_workload
def matmul_layer(m, n, k):
    A = te.placeholder((m, k), name="A",  dtype="float16")
    B = te.placeholder((k, n), name="B", dtype="float16")
    k = te.reduce_axis((0, k), name="k")
    C = te.compute((m, n), lambda m, n: te.sum(A[m, k].astype("float32") * B[k, n].astype("float32"), axis=k), name="C")
    D = te.compute((m, n), lambda i, j: C[i, j].astype("float16"))
    return [A, B, D]