I dropped a print
statement into the default AVX x86 conv2d schedule, so I know that this is the schedule that is being run.
To check if there is an int16 fallback, I can look at the code generated at each stage. However wouldn’t int16 still be faster than float32, unless there is a big casting overhead?
It doesn’t look like there is int16 fallback happening, I explain how I have checked below:
After quantization, before compilation
This is the same regardless of the backend I use, since we haven’t actually compiled at this point.
I get the following output running: mod = quantize(mod, params, mode=mode); print(mod)
.
def @main(%data: Tensor[(1, 3, 64, 64), float32]) -> Tensor[(1, 16, 64, 64), float32] {
%0 = nn.conv2d(%data, meta[relay.Constant][0] /* ty=Tensor[(32, 3, 3, 3), float32] */, padding=[1, 1, 1, 1], channels=32, kernel_size=[3, 3]) /* ty=Tensor[(1, 32, 64, 64), float32] */;
%1 = nn.relu(%0) /* ty=Tensor[(1, 32, 64, 64), float32] */;
%2 = annotation.stop_fusion(%1) /* ty=Tensor[(1, 32, 64, 64), float32] */;
%3 = multiply(%2, 16f /* ty=float32 */) /* ty=Tensor[(1, 32, 64, 64), float32] */;
%4 = round(%3) /* ty=Tensor[(1, 32, 64, 64), float32] */;
%5 = clip(%4, a_min=-127f, a_max=127f) /* ty=Tensor[(1, 32, 64, 64), float32] */;
%6 = cast(%5, dtype="int8") /* ty=Tensor[(1, 32, 64, 64), int8] */;
%7 = nn.conv2d(%6, meta[relay.Constant][1] /* ty=Tensor[(16, 32, 3, 3), int8] */, padding=[1, 1, 1, 1], channels=16, kernel_size=[3, 3], out_dtype="int32") /* ty=Tensor[(1, 16, 64, 64), int32] */;
%8 = nn.relu(%7) /* ty=Tensor[(1, 16, 64, 64), int32] */;
%9 = add(%8, 1024 /* ty=int32 */) /* ty=Tensor[(1, 16, 64, 64), int32] */;
%10 = right_shift(%9, 11 /* ty=int32 */) /* ty=Tensor[(1, 16, 64, 64), int32] */;
%11 = clip(%10, a_min=-127f, a_max=127f) /* ty=Tensor[(1, 16, 64, 64), int32] */;
%12 = cast(%11, dtype="int8") /* ty=Tensor[(1, 16, 64, 64), int8] */;
%13 = annotation.stop_fusion(%12) /* ty=Tensor[(1, 16, 64, 64), int8] */;
%14 = cast(%13, dtype="float32") /* ty=Tensor[(1, 16, 64, 64), float32] */;
multiply(%14, 0.0625f /* ty=float32 */) /* ty=Tensor[(1, 16, 64, 64), float32] */
}
After compilation
Instead of creating a GraphModule, I compile using relay.build
, i.e.:
with relay.build_config(opt_level=3):
graph, lib, params = relay.build(mod, target=target, target_host=target)
Print graph
If I print print(graph)
, I see than the types look fine:
"attrs": {
"dltype": [
"list_str",
[
"float32",
"float32",
"float32",
"float32",
"uint8",
"int8",
"int32",
"int8",
"float32"
]
],
LLVM source
The only way I know to look at the generated code directly is by dumping the LLVM using lib.get_source()
. Doing this is of course very verbose, and I see lots of i16 and i8 instructions.