@junrushao Yeah I see, but seems we’re not yet able to lower & build a TIR module in the master branch now?
(Maybe I can have a try on the tensorir private branch…)
@FrozenGene I agree, I think this is the limitation of all the high level abstractions, other implementation may in the end have a same problem. So looks like it will be more difficult to achieve our goal of using techniques like Ansor to solve most of the performance problem in different devices…
Another investigation is that there’s a code snippet in ACL like:
......
"ldr q6, [x15, #0x0]\n"
"fmla v8.4s, v6.4s, v0.s[0]\n"
......
"ldr q6, [x15, #0x40]\n"
"fmla v8.4s, v6.4s, v0.s[1]\n"
......
"ldr q6, [x15, #0x80]\n"
"fmla v8.4s, v6.4s, v0.s[2]\n"
......
"ldr q6, [x15, #0xc0]\n"
"fmla v8.4s, v6.4s, v0.s[3]\n"
......
Which produces the SIMD fma across the data loaded to (q6/v6) and the data stored in v0.
But the asm generated by TVM seems to never use buffer like v*.s[1], v*.s[2], v*.s[3]. I think this is a simpler problem than the register buffer control.