Intution on why this int8 algorithm is slower?

I’ve been exploring quantization in TVM, and one thing that I found that on the CPU there is a special compute/schedule for running int8 conv2d on the CPU (see here). From what I can tell, it seems to be pretty much the same as standard CPU spatial pack convolution.

To explore this, I tried disabling this special compute/schedule, and let the quantized model use the standard spatial pack algorithm (just running quantized). When I do this, I see an expected slowdown compared to the specialized version, however I see an unexpected slow down compared to the float32 version of the same algorithm.

For a simple example I get the following results:

default int8: 7.529054908081889
modified int8: 23.42591354623437
normal float32: 11.465726513415575

(Disabling the algorithm is very simple: just comment out the if block that checks for int8 here).

My main question is why am I seeing a slowdown using the standard convolution approach?

Surely the operations would be the same, just using integers? And on most CPUs, that would take fewer clock cycles. Where would that overhead be coming from?

I would assume that the specialised compute/schedule would better exploit the quantization (e.g. the fact you can get more values into SIMD). However that still doesn’t explain why modified is slower than normal.

Maybe the slowdown is due to int16 fallback? Or, since you modified the compute, the “right” schedule may not be getting called.

I dropped a print statement into the default AVX x86 conv2d schedule, so I know that this is the schedule that is being run.

To check if there is an int16 fallback, I can look at the code generated at each stage. However wouldn’t int16 still be faster than float32, unless there is a big casting overhead?

It doesn’t look like there is int16 fallback happening, I explain how I have checked below:

After quantization, before compilation

This is the same regardless of the backend I use, since we haven’t actually compiled at this point.

I get the following output running: mod = quantize(mod, params, mode=mode); print(mod).

def @main(%data: Tensor[(1, 3, 64, 64), float32]) -> Tensor[(1, 16, 64, 64), float32] {
  %0 = nn.conv2d(%data, meta[relay.Constant][0] /* ty=Tensor[(32, 3, 3, 3), float32] */, padding=[1, 1, 1, 1], channels=32, kernel_size=[3, 3]) /* ty=Tensor[(1, 32, 64, 64), float32] */;
  %1 = nn.relu(%0) /* ty=Tensor[(1, 32, 64, 64), float32] */;
  %2 = annotation.stop_fusion(%1) /* ty=Tensor[(1, 32, 64, 64), float32] */;
  %3 = multiply(%2, 16f /* ty=float32 */) /* ty=Tensor[(1, 32, 64, 64), float32] */;
  %4 = round(%3) /* ty=Tensor[(1, 32, 64, 64), float32] */;
  %5 = clip(%4, a_min=-127f, a_max=127f) /* ty=Tensor[(1, 32, 64, 64), float32] */;
  %6 = cast(%5, dtype="int8") /* ty=Tensor[(1, 32, 64, 64), int8] */;
  %7 = nn.conv2d(%6, meta[relay.Constant][1] /* ty=Tensor[(16, 32, 3, 3), int8] */, padding=[1, 1, 1, 1], channels=16, kernel_size=[3, 3], out_dtype="int32") /* ty=Tensor[(1, 16, 64, 64), int32] */;
  %8 = nn.relu(%7) /* ty=Tensor[(1, 16, 64, 64), int32] */;
  %9 = add(%8, 1024 /* ty=int32 */) /* ty=Tensor[(1, 16, 64, 64), int32] */;
  %10 = right_shift(%9, 11 /* ty=int32 */) /* ty=Tensor[(1, 16, 64, 64), int32] */;
  %11 = clip(%10, a_min=-127f, a_max=127f) /* ty=Tensor[(1, 16, 64, 64), int32] */;
  %12 = cast(%11, dtype="int8") /* ty=Tensor[(1, 16, 64, 64), int8] */;
  %13 = annotation.stop_fusion(%12) /* ty=Tensor[(1, 16, 64, 64), int8] */;
  %14 = cast(%13, dtype="float32") /* ty=Tensor[(1, 16, 64, 64), float32] */;
  multiply(%14, 0.0625f /* ty=float32 */) /* ty=Tensor[(1, 16, 64, 64), float32] */
}

After compilation

Instead of creating a GraphModule, I compile using relay.build, i.e.:

with relay.build_config(opt_level=3):
    graph, lib, params = relay.build(mod, target=target, target_host=target)

Print graph

If I print print(graph), I see than the types look fine:

  "attrs": {
    "dltype": [
      "list_str",
      [
        "float32",
        "float32",
        "float32",
        "float32",
        "uint8",
        "int8",
        "int32",
        "int8",
        "float32"
      ]
    ],

LLVM source

The only way I know to look at the generated code directly is by dumping the LLVM using lib.get_source(). Doing this is of course very verbose, and I see lots of i16 and i8 instructions.