Different input and output dtypes for multiplication

anijain2305 · April 7, 2020, 8:46pm

Is it possible to have a Relay multiply operator that has 2 int32 tensor as inputs but int64 as output?

This type of instructions are common in ARM HW. The issue is that if I make the inputs int64 by casting, it is not semantically equivalent to assembly instruction that does int64 upcasting internally.

@tqchen @FrozenGene

anijain2305 · April 7, 2020, 8:48pm

Even if the HW instruction is not present, I can play around with hi/lo 16 bits of the input 32-bit tensors and perform 4 int32 multiplications instead of 1 int64 multiplication (which might be faster).

FrozenGene · April 8, 2020, 4:55am

I think relay can not complete this work. But maybe we could do it in codegen part. We want to generate specific platform instruction, in codegen part, if we match the pattern we want, we generate this instruction (use LLVM intrinsic or IR replacement else). Maybe this is one way we could consider.

tqchen · April 8, 2020, 5:26am

Usually this is done by explicitly cast to i64 then multiply. After fusion, the llvm instruction selector will be able to pattern match and find the right instruction m

anijain2305 · April 8, 2020, 6:24am

I see. So, what I want to do seems difficult then.

I am profiling requantize op on Pi 4. Requantize has int64 multilplication, add and right shift. I observed that they are quite expensive, sometimes accounting for 30% of total runtime of mobilenet type of networks. TFLite solves this problem, obviously, by hand-writing assembly, where VQRDMULH internally takes care of the int64 handling.

// This function implements the same computation as the ARMv7 NEON VQRDMULH
// instruction.

template <>
inline std::int32_t SaturatingRoundingDoublingHighMul(std::int32_t a,
                                                      std::int32_t b) {
  bool overflow = a == b && a == std::numeric_limits<std::int32_t>::min();
  std::int64_t a_64(a);
  std::int64_t b_64(b);
  std::int64_t ab_64 = a_64 * b_64;
  std::int32_t nudge = ab_64 >= 0 ? (1 << 30) : (1 - (1 << 30));
  std::int32_t ab_x2_high32 =
      static_cast<std::int32_t>((ab_64 + nudge) / (1ll << 31));
  return overflow ? std::numeric_limits<std::int32_t>::max() : ab_x2_high32;
}

Here, the pattern matching might be too difficult for LLVM to directly use that intrinsic. I will take a deeper look and see if I can use tensorize.

Thanks for the comments.