I am working with an int8 quantised model (super simple 3 layer TFLite one, gist here, using tensorflow==2.15
, not 2.16
). If you load the model in netron, you can see that the main (depth)conv2d weights are int8.
However, if I load the model in TVM (and export the weights using the debugger, so I can look at them in JSON), you can see that they are using a type which is too large.
For the weights of the depthwise conv (of shape 1x3x3x3), they are called p1
, and are of type int16
. Similarly, for the conv2d weight (of shape 1x1x3x64), (p8), these are also int16
.
I’ve seen this for other models I’ve been working with, but this simple example is easier to discuss.
Is this intended behavior? Is there are a way to disable it?
I can imagine a justification that on some architectures using int16 is faster than int8. However, in my case memory is very constrained.
I’m loading my model into TVM with:
input_dtype = "int8"
relay_mod, params = relay.frontend.from_tflite(
tflite_model,
shape_dict={input_name: input_shape},
dtype_dict={input_name: input_dtype},
)