I run opt_gemm.py building on a cortex-A7 platform(armv7a), using: target='llvm -device=arm_cpu -model=xxx -target=armv7a-linux-eabi -mattr=+neon,+thumb2 -mfloat-abi=soft’
I also tried the target as arm-linux-gnueabi the same with it’s gcc -v. The llvm version is 8.0.
The performance is poor, so i export so library and objdump it, i find there are no vfma.f32 intructions. only see inefficient instructions like vmul,vadd.
In contrast, i write a matmul using c language, and compile it with cross compile gcc and -O3, then i can see vfma.f32 in objdump file.
TVM doesn’t use fma intrinsic, use %165 = tail call <32 x float> @llvm.fmuladd.v32f32(<32 x float> %161, <32 x float> %152, <32 x float> %164)
It will check /// Return true if an FMA operation is faster than a pair of fmul and fadd /// instructions. fmuladd intrinsics will be expanded to FMAs when this method /// returns true, otherwise fmuladd is expanded to fmul + fadd. bool isFMAFasterThanFMulAndFAdd(EVT VT) const override;
Seems LLVM says A7 FMA doesn’t be faster than fmul + fadd
On cortex-a73 and cortex-a7,we comprared tvm gemm performance with hand-optimized assembly gemm.
There always could get closer performance with hand-optimized gemm on cortex-a73, while it is far slower than hand-optimized gemm on cortex-a7.
of course, maybe you doubt that cortex-a7’s hand-optimization is better, but i can see that performance boost is smaller than other arm archi compared with non-optimized(only use gcc -O3) C matmul.
Thank you! I will try it.
Maybe it will get better performance on cortex-a7, both cross gcc’s -O3 and hand-optimized gemm use vfma get good performance.
Refer to llvm-8.0.0.src/lib/Target/ARM/ARMISelLowering.h as your post, Why llvm set it false for all arm32 platform?
I remember someone once told me that some lower arm platform’s vfma instruction isn’t a real combined instruction, it actually split it into vmul and vadd when cpu execute it.
/// isFMAFasterThanFMulAndFAdd - Return true if an FMA operation is faster
/// than a pair of fmul and fadd instructions. fmuladd intrinsics will be
/// expanded to FMAs when this method returns true, otherwise fmuladd is
/// expanded to fmul + fadd.
///
/// ARM supports both fused and unfused multiply-add operations; we already
/// lower a pair of fmul and fadd to the latter so it's not clear that there
/// would be a gain or that the gain would be worthwhile enough to risk
/// correctness bugs.
bool isFMAFasterThanFMulAndFAdd(EVT VT) const override { return false; }
Explicitly define ARMISelLowering::isFMAFasterThanFMulAndFAdd. No functionality change.
Currently ARM is the only backend that supports FMA instructions (for at least some subtargets) but does not implement this virtual, so FMAs are never generated except from explicit fma intrinsic calls. Apparently this is due to the fact that it supports both fused (one rounding step) and unfused (two rounding step) multiply + add instructions. This patch clarifies that this the case without changing behavior by implementing the virtual function to simply return false, as the default TargetLoweringBase version does.
It is possible that some cpus perform the fused version faster than the unfused version and vice-versa, so the function implementation should be revisited if hard data is found.
In my view that’s an LLVM bug and needs to be fixed there - it would be good if you could submit that upstream especially with the commit message reference to llvm about the default being changed on a per cpu basis.