I see. So, what I want to do seems difficult then.
I am profiling requantize op on Pi 4. Requantize has int64 multilplication, add and right shift. I observed that they are quite expensive, sometimes accounting for 30% of total runtime of mobilenet type of networks. TFLite solves this problem, obviously, by hand-writing assembly, where VQRDMULH
internally takes care of the int64 handling.
// This function implements the same computation as the ARMv7 NEON VQRDMULH
// instruction.
template <>
inline std::int32_t SaturatingRoundingDoublingHighMul(std::int32_t a,
std::int32_t b) {
bool overflow = a == b && a == std::numeric_limits<std::int32_t>::min();
std::int64_t a_64(a);
std::int64_t b_64(b);
std::int64_t ab_64 = a_64 * b_64;
std::int32_t nudge = ab_64 >= 0 ? (1 << 30) : (1 - (1 << 30));
std::int32_t ab_x2_high32 =
static_cast<std::int32_t>((ab_64 + nudge) / (1ll << 31));
return overflow ? std::numeric_limits<std::int32_t>::max() : ab_x2_high32;
}
Here, the pattern matching might be too difficult for LLVM to directly use that intrinsic. I will take a deeper look and see if I can use tensorize.
Thanks for the comments.