Benchmarking Quantization on Intel CPU

Which are the parameters that you are using to run the evaluate.py script?

do you mean the model I used?

It is my own Mobilenet SSD based on GluonCV. I have merged the Batch Normal into Convolution mannually (verified, this step is correct), and use MultiBoxPrior and MulbitBoxDetection instead of original implementation of SSD detector head part.

The configs of script in my case:

  • replace args.target = 'llvm' to args.target='llvm -mcpu=core-avx2'
  • use my own .rec dataset, and generate the dataloader object named eval_data in my above picture.

It could be that there are no tuned schedule configs for your model. Do you get any warnings about fallback configs being used?

Yes, there are many warnings, one of them is listed below:

WARNING: autotvm:Cannot find config for target=llvm -mcpu=core-avx2, workload=(‘conv2d’, (1, 3, 300, 300, ‘int8’), (32, 3, 3, 3, ‘int8’), (2, 2), (1, 1), (1, 1), ‘NCHW’, ‘int32’). A fallback configuration is used, which may bring great performance regression.

When I change target = llvm -mcpu=core-avx2 to target = llvm, the warnings still exists. and the performance have a greater performacne regression for quantized model. But for FP32 weights, the performance is the best (still slower than original MXNet implementation).

Or, how can I add the config the target=llvm -mcpu=core-avx2 to avoid the performance regression?
thanks

You should try tuning your model with the highest level of AVX extensions that your CPU supports (avx2 or avx512). Tuning tutorial: https://docs.tvm.ai/tutorials/autotvm/tune_nnvm_x86.html

Yes, I have tuned my model, it is a little slow. I will try it out anyway. Thanks for your advise

However, is it normal that the CPU usage is only ~2.0% during tuning ? I have set TVM_NUM_THREADS to “1”, and OMP_NUM_THREADS to “1” as well. I want to test the latency of model using only one thread when deploy it.

If you have tuned your model, you should only see warnings for untuned operators, such as dense. 2.0% CPU usage would be expected if you have > 32 threads on your system. Note that in this case you should also set the environment variable before tuning.

Ok, thankyou ~~~~~~~

Hi TriLoon,

Thanks for the details.

I meant the arguments to configure the script. By default the configuration is the following:

INFO:root:Namespace(batch_size=1, dtype_input=‘int8’, dtype_output=‘int32’, global_scale=8.0, log_interval=100, model=‘resnet18_v1’, nbit_input=8, nbit_output=32, num_classes=1000, original=False, rec_val=’~/.mxnet/datasets/imagenet/rec/val.rec’, simulated=False, target=‘llvm’)

Now I see that you changed mainly the llvm target config.

Would be nice if you could share your modified evaluate.py and other files that you are using to reproduce your setup and test it on my side.

Thanks!

Sure, how can I share my files to you? how about email ?

FYI, we have enabled the mobilenet v2 and update the data on new VNNI enabled machine, C5.12xlarge.
Please refer

Several SSD based models are available in GluonCV repo.

https://gluon-cv.mxnet.io/build/examples_deployment/int8_inference.html

hi,tvm support int8 quantize on arm?

Hi TriLoon,

Did you manage to get the quantized version on avx2 run faster than the non-quantized version?

Thanks!

Sorry, I turned to OpenVINO to speedup my models finally ~

I am not sure whether the tvm can speedup the detection models on CPU.