Just want to share the performance for TFLite and TVM for TFLite pre-quantized models (models that have already been quantized using Tensorflow/TFLite). TFLite is faster than TFLite for now. This thread can be used to discuss possible improvements and we can keep updating the numbers as we come up with better schedules or Relay optimizations.
Setup - Rasp4b - Both TVM and TFLite are running with 4 threads. TVM kernels have already been tuned.
If we are slow on 4 threads, I think on 1 thread, we will slow more compared with TFLite.
I think the reason has been discussed before some times, especially in the @jackwish’s this answer: TF Lite quantized conv2d operator conversion - #20 by jackwish - Questions - Apache TVM Discuss Your performance number is almost the same as our initial quantized performance (although we only just record Mobilenet V1 / V2). @jackwish’s share is our development experience how to improve performance. I could almost make sure the performance reason is the intermediate memory access. So as @jackwish share
If we break these two steps into two operators, the first reads INT8 writes INT32 into memory and the second reads INT32 and writes INT8. In our test, this approach showed significant performance drop. (I’d like to make it clear that we have tensorized step 1 which may prevent step 2 from fusion.) As soon as we merged them into one in tensorize micro kernel, we got basically same performance as QNNPACK. The difference here is if there is INT32 intermedia memory access in the operator, if the computing is merged, the INT32 intermedia result (the accumulated result) can serve in registers.
So, to summarize, regarding @janimesh 's proposals, I think option 1 may get performance similar to TFLite, while option 2 is more capable of enabling powerful tensorize design.
This is why we create one operator named as q_conv2d to complete all work.
However, I don’t have upstream our implementation is I wish our coming auto scheduler (very soon) could help us complete some work (like layout) and we could contribute our implementation on it. I don’t want to let you guys think that we don’t want to contribute, so I want to make some explain.
Thanks @FrozenGene I agree. We need better schedule. Currently, I am using NCHWc, which is better than NCHW, but might be slower than NHWC. Another major improvement should come from tensorization. Currently we are relying on LLVM.
For the Int8/Int32 memory bandwidth issue, this is already happening because of the Relay fusion. Currently conv is fused with 7-8 ops after it, basically fusing conv2d + requantize. I think we can further micro-optimize this, but the structure is already there. We would not need any new Relay/TVM feature.
Looking forward to your auto-scheduler work. And I hope it can help int8 schedules as well.
@anijain2305 As we have delivered work to OSDI and are rebasing our code with the latest master (wish we could start to bring it in this month), I want to share one quick data for you. One rasp 3B+, mobilenet v2 quantized model, TFLite 2.1 is 53.839ms(compared with TFLite 1.14, it has big improvement), Auto TVM is 76.08ms,However auto schedule is 43.53ms, it is 1.2x compared with tflite. In fact, we still have room to improve (reduce load instruction), but I think it is a good start.
@henry099 Yes, we have seen TFlite and TVM outputs to differ slightly. This is due to differences in TFlite and TVM compute (rounding and maybe some other differences). However, we have observed that these minor differences have minimal effect on application accuracy (Top1/Top5).
@anijain2305 Thanks replying.As a TVM beginner,I think quant model should be INT type compute,where will get rounding error?I had tried replace TVM’s Multiplier as same with TFLITE,but result still different.Any other clues to try?
I had try like mobilenet_0.25_128/96,top1/top5 will get more effect
I have not tried mobilnet_0.25. I tried original mobilenet v1 and v2 and got good results.
Yes, the quantized convolutions are integer datatype, but we have to call requantize operator frequently to adjust quantization parameters. The requantize requires a fixed point multiplication, and hence a rounding.
Hi folks, has the performance diff been addressed? OctoML has the capability to compile models and measure latency with both TVM and TFLite. Do we have a similar effort/capability?