Intution on why this int8 algorithm is slower?

I dropped a print statement into the default AVX x86 conv2d schedule, so I know that this is the schedule that is being run.

To check if there is an int16 fallback, I can look at the code generated at each stage. However wouldn’t int16 still be faster than float32, unless there is a big casting overhead?

It doesn’t look like there is int16 fallback happening, I explain how I have checked below:

After quantization, before compilation

This is the same regardless of the backend I use, since we haven’t actually compiled at this point.

I get the following output running: mod = quantize(mod, params, mode=mode); print(mod).

def @main(%data: Tensor[(1, 3, 64, 64), float32]) -> Tensor[(1, 16, 64, 64), float32] {
  %0 = nn.conv2d(%data, meta[relay.Constant][0] /* ty=Tensor[(32, 3, 3, 3), float32] */, padding=[1, 1, 1, 1], channels=32, kernel_size=[3, 3]) /* ty=Tensor[(1, 32, 64, 64), float32] */;
  %1 = nn.relu(%0) /* ty=Tensor[(1, 32, 64, 64), float32] */;
  %2 = annotation.stop_fusion(%1) /* ty=Tensor[(1, 32, 64, 64), float32] */;
  %3 = multiply(%2, 16f /* ty=float32 */) /* ty=Tensor[(1, 32, 64, 64), float32] */;
  %4 = round(%3) /* ty=Tensor[(1, 32, 64, 64), float32] */;
  %5 = clip(%4, a_min=-127f, a_max=127f) /* ty=Tensor[(1, 32, 64, 64), float32] */;
  %6 = cast(%5, dtype="int8") /* ty=Tensor[(1, 32, 64, 64), int8] */;
  %7 = nn.conv2d(%6, meta[relay.Constant][1] /* ty=Tensor[(16, 32, 3, 3), int8] */, padding=[1, 1, 1, 1], channels=16, kernel_size=[3, 3], out_dtype="int32") /* ty=Tensor[(1, 16, 64, 64), int32] */;
  %8 = nn.relu(%7) /* ty=Tensor[(1, 16, 64, 64), int32] */;
  %9 = add(%8, 1024 /* ty=int32 */) /* ty=Tensor[(1, 16, 64, 64), int32] */;
  %10 = right_shift(%9, 11 /* ty=int32 */) /* ty=Tensor[(1, 16, 64, 64), int32] */;
  %11 = clip(%10, a_min=-127f, a_max=127f) /* ty=Tensor[(1, 16, 64, 64), int32] */;
  %12 = cast(%11, dtype="int8") /* ty=Tensor[(1, 16, 64, 64), int8] */;
  %13 = annotation.stop_fusion(%12) /* ty=Tensor[(1, 16, 64, 64), int8] */;
  %14 = cast(%13, dtype="float32") /* ty=Tensor[(1, 16, 64, 64), float32] */;
  multiply(%14, 0.0625f /* ty=float32 */) /* ty=Tensor[(1, 16, 64, 64), float32] */
}

After compilation

Instead of creating a GraphModule, I compile using relay.build, i.e.:

with relay.build_config(opt_level=3):
    graph, lib, params = relay.build(mod, target=target, target_host=target)

Print graph

If I print print(graph), I see than the types look fine:

  "attrs": {
    "dltype": [
      "list_str",
      [
        "float32",
        "float32",
        "float32",
        "float32",
        "uint8",
        "int8",
        "int32",
        "int8",
        "float32"
      ]
    ],

LLVM source

The only way I know to look at the generated code directly is by dumping the LLVM using lib.get_source(). Doing this is of course very verbose, and I see lots of i16 and i8 instructions.