Thanks for the wonderful library. It’s a pleasure to work with it.
I am a bit puzzled by the generated code for a matrix multiplication. I see neither 256bit SIMD vectors nor FMA instructions although I specify the target as: llvm -mcpu=alderlake -mattr=+avx2 -num-cores=16
. What could be the reason?
My code is:
import tvm
from tvm.script import tir as T
import numpy as np
from tvm import meta_schedule as ms
@tvm.script.ir_module
class MyModule:
@T.prim_func
def main(
A: T.Buffer[(1024, 1024), "float32"],
B: T.Buffer[(1024, 1024), "float32"],
C: T.Buffer[(1024, 1024), "float32"],
):
T.func_attr({"global_symbol": "main", "tir.noalias": True})
for i, j, k in T.grid(1024, 1024, 1024):
with T.block("C"):
vi, vj, vk = T.axis.remap("SSR", [i, j, k])
with T.init():
C[vi, vj] = 0.0
C[vi, vj] = C[vi, vj] + A[vi, vk] * B[vk, vj]
dtype = "float32"
a_np = np.random.rand(1024, 1024).astype(dtype)
b_np = np.random.rand(1024, 1024).astype(dtype)
a_nd = tvm.nd.array(a_np)
b_nd = tvm.nd.array(b_np)
c_nd = tvm.nd.empty((1024, 1024), dtype="float32")
target = "llvm -mcpu=alderlake -mattr=+avx2 -num-cores=16"
database = ms.tune_tir(
mod=MyModule,
target=target,
max_trials_global=64,
num_trials_per_iter=64,
work_dir="./tune_tmp",
)
sch_tuned = ms.tir_integration.compile_tir(database, MyModule, target=target)
print(sch_tuned.mod.script())
lib = tvm.build(sch_tuned.mod, target="llvm")
with open('/tmp/my_module.S', 'w') as f:
f.write(lib.get_source('asm'))
Looking at the assembly, I see 128 vectors being used, and mulps
and addps
instructions. What can I do to improve codegen?