Hi, NVIDIA released CUDA 10.2 yesterday. Here are few things TVM can be excited about.
- Warp wide mma operations for fp16, (mma.sync.m16n8k8)
- Warp wide mma operations for s8, (mma.sync.m8n8k16)
- Warp wide mma operations for s4, (mma.sync.m8n8k32)
- Load matrices directly from shared memory to registers to be directly consumed by mma instruction (ldmatrix)
These are good additions to mma.sync.m8n8k4 from 10.1