Why are neither 256bit SIMD vectors nor FMAD instructions used by autotuning?

Thanks for the wonderful library. It’s a pleasure to work with it.

I am a bit puzzled by the generated code for a matrix multiplication. I see neither 256bit SIMD vectors nor FMA instructions although I specify the target as: llvm -mcpu=alderlake -mattr=+avx2 -num-cores=16. What could be the reason?

My code is:

import tvm
from tvm.script import tir as T
import numpy as np
from tvm import meta_schedule as ms


@tvm.script.ir_module
class MyModule:
    @T.prim_func
    def main(
        A: T.Buffer[(1024, 1024), "float32"],
        B: T.Buffer[(1024, 1024), "float32"],
        C: T.Buffer[(1024, 1024), "float32"],
    ):
        T.func_attr({"global_symbol": "main", "tir.noalias": True})
        for i, j, k in T.grid(1024, 1024, 1024):
            with T.block("C"):
                vi, vj, vk = T.axis.remap("SSR", [i, j, k])
                with T.init():
                    C[vi, vj] = 0.0
                C[vi, vj] = C[vi, vj] + A[vi, vk] * B[vk, vj]


dtype = "float32"
a_np = np.random.rand(1024, 1024).astype(dtype)
b_np = np.random.rand(1024, 1024).astype(dtype)

a_nd = tvm.nd.array(a_np)
b_nd = tvm.nd.array(b_np)
c_nd = tvm.nd.empty((1024, 1024), dtype="float32")

target = "llvm -mcpu=alderlake -mattr=+avx2 -num-cores=16"
database = ms.tune_tir(
    mod=MyModule,
    target=target,
    max_trials_global=64,
    num_trials_per_iter=64,
    work_dir="./tune_tmp",
)


sch_tuned = ms.tir_integration.compile_tir(database, MyModule, target=target)
print(sch_tuned.mod.script())

lib = tvm.build(sch_tuned.mod, target="llvm")
with open('/tmp/my_module.S', 'w') as f:
    f.write(lib.get_source('asm'))

Looking at the assembly, I see 128 vectors being used, and mulps and addps instructions. What can I do to improve codegen?

I realized that the docs for tvm.target.Target state that mcpu options only serves as an annotation. Thus, the mattr string increases in importance, and I need to pass the +fma flag too.