TF Lite quantized conv2d operator conversion

In fact, on ARM CPU, if want to produce VMLAL, normal way can not complete. i.e. Option 1 / 2 both can not do. Because ARM CPU only support int16->int32 VMLAL. So we have to tensorize. i.e. LOAD 8 elements U8 and convert to INT16, SUB zero_point, then compute, finally we can leverage VMLAL. Option 1 maybe suit for VNNI instruction on Intel CPU.

Maybe I prefer Option 2. i.e. we provide standard / normal spatial_pack schedule on ARM CPU / Intel CPU, and . on generic target (i.e. nn.py), we provide q_conv2d's compute in naive way. Spatial pack schedule performs well in fact according to our test and implement it very easily. But if you want to improve, you could write tensorize version for your target. For example, to produce ARM CPU’s VMLAL / Intel’s VNNI.