[Discussion] Support for AMD Matrix Core

LeiWang1999 · June 14, 2023, 7:04am

Hi everyone, I’d like to initiate a discussion about AMD Matrix Core. As we know, AMD Matrix Core is some mix-precision matrix compute specific hardware accelerators like NVIDIA Tensor Core, which both of them brings significant speed up in machining leaning tasks, since tvm has already worked with various datatypes and platforms, I think it might be great interesting to explore the Matrix Core Integration since AMD vendor libraries like rocBLAS and composable kernel still has very poor performance on matrix core computation and that’s what I wanna to do for now.

Based on my observations of the TVM ROCm backend and some ROCm programming experience, I think there are two possible routes to choose from:

Continue with the current LLVM IR codegen: LLVM IR now supports matrix core. However, I’m curious about why TVM ROCm code generation uses LLVM IR as its backend.
Reuse CUDA source code generation for HIP: In my experience, HIP and CUDA codes have a high degree of overlap, as I can simply change cuda_runtime.h to hip_runtime.h and compile the code with hipcc. The source code can then work correctly. For matrix core, I would just need to replace nvcuda::wmma with rocwmma.

Please let me know which route do you prefer ? Thanks.

LeiWang1999 · June 13, 2023, 3:48pm

Please feel free to share your comments and suggestions.

tqchen · June 13, 2023, 6:08pm

I think it is a great idea. LLVM backend should work. If the lift of hip is low, having an alternative hip is not a bad idea since that provides some hack referencing abilities. We also had recent vulkan support, which is another route to cover environments that are not covered by rocm

FrozenGene · June 14, 2023, 12:18am

llvm is a good idea to go. Since amd open source, we could dig it very deeply. Hip is also a good alternative, however from my view, its goal is compatible of cuda as much as possible, if we want to expolore exetreme performance, i think llvm maybe a better way

junrushao · June 14, 2023, 12:23am

@vinx13 and I had some experience working on ROCm backend for AMD GPU. Just like TensorCore, matrix core can be expressed with affine expression which is well supported by TensorIR

LeiWang1999 · June 16, 2023, 9:01am

agreed, looks like hip code doesn’t have good performance on matrix core.

LeiWang1999 · June 15, 2023, 2:20pm

Feel free to review the changes and provide any feedback or suggestions. unlike tensor core, few mfma intrinsic candidates for a given computation…

github.com/apache/tvm

[TensorIR][ROCm] AMD Matrix Core Support.

apache:main ← LeiWang1999:lei/feat-mfma

opened 02:18PM - 15 Jun 23 UTC

LeiWang1999

+869 -0

This Pull Request adds support for AMD Matrix Core in TVM. ## Changes Made T…he following changes have been made to enable AMD Matrix Core support in TVM: - Added ROCm tensor intrins for AMD Matrix Core architecture. - Added test case of a 1024x1024x1024 dense gemm on each of these computations - Implemented the required tile sizes for Matrix FMA (MFMA) computations. The available tile sizes for MFMA are as follows: - Integer computation: i8xi8 - Half-precision computation: f16xf16 - Single-precision computation: f32xf32 refer to AMD matrix core [readme](https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/), available tile for the given computations could be: <html> <body> A/B Data Format | C/D Data Format | M | N | K | Blocks | Cycles | Flops/cycle/CU -- | -- | -- | -- | -- | -- | -- | -- FP32 | FP32 | 32 | 32 | 2 | 1 | 64 | 256 FP32 | FP32 | 16 | 16 | 4 | 1 | 32 | 256 FP16 | FP32 | 32 | 32 | 8 | 1 | 64 | 1024 FP16 | FP32 | 16 | 16 | 16 | 1 | 32 | 1024 INT8 | INT32 | 32 | 32 | 8 | 1 | 64 | 1024 INT8 | INT32 | 16 | 16 | 16 | 1 | 32 | 1024 </body> </html> For each of these computations, only one intrinsic has been chosen for implementation. This decision is based on their identical TFLOPS performance. Considering real-world systems requirements, we have selected a small 'm' tile and a large 'k' tile to optimize the performance. Please review the changes and provide any feedback or suggestions for improvement, see more discussions [here](https://discuss.tvm.apache.org/t/discussion-support-for-amd-matrix-core/15121).