Benchmarking Quantization on Intel CPU

tico · June 12, 2019, 3:15pm

Which are the parameters that you are using to run the evaluate.py script?

TriLoon · June 12, 2019, 3:33pm

do you mean the model I used?

It is my own Mobilenet SSD based on GluonCV. I have merged the Batch Normal into Convolution mannually (verified, this step is correct), and use MultiBoxPrior and MulbitBoxDetection instead of original implementation of SSD detector head part.

The configs of script in my case:

replace args.target = 'llvm' to args.target='llvm -mcpu=core-avx2'
use my own .rec dataset, and generate the dataloader object named eval_data in my above picture.

eqy · June 12, 2019, 8:01pm

It could be that there are no tuned schedule configs for your model. Do you get any warnings about fallback configs being used?

TriLoon · June 13, 2019, 2:04am

Yes, there are many warnings, one of them is listed below:

WARNING: autotvm:Cannot find config for target=llvm -mcpu=core-avx2, workload=(‘conv2d’, (1, 3, 300, 300, ‘int8’), (32, 3, 3, 3, ‘int8’), (2, 2), (1, 1), (1, 1), ‘NCHW’, ‘int32’). A fallback configuration is used, which may bring great performance regression.

When I change target = llvm -mcpu=core-avx2 to target = llvm, the warnings still exists. and the performance have a greater performacne regression for quantized model. But for FP32 weights, the performance is the best (still slower than original MXNet implementation).

Or, how can I add the config the target=llvm -mcpu=core-avx2 to avoid the performance regression?
thanks

eqy · June 13, 2019, 3:27am

You should try tuning your model with the highest level of AVX extensions that your CPU supports (avx2 or avx512). Tuning tutorial: https://docs.tvm.ai/tutorials/autotvm/tune_nnvm_x86.html

TriLoon · June 13, 2019, 4:49am

Yes, I have tuned my model, it is a little slow. I will try it out anyway. Thanks for your advise

However, is it normal that the CPU usage is only ~2.0% during tuning ? I have set TVM_NUM_THREADS to “1”, and OMP_NUM_THREADS to “1” as well. I want to test the latency of model using only one thread when deploy it.

eqy · June 13, 2019, 5:36am

If you have tuned your model, you should only see warnings for untuned operators, such as dense. 2.0% CPU usage would be expected if you have > 32 threads on your system. Note that in this case you should also set the environment variable before tuning.

TriLoon · June 13, 2019, 6:01am

Ok, thankyou ~~~~~~~

tico · June 13, 2019, 8:05am

Hi TriLoon,

Thanks for the details.

I meant the arguments to configure the script. By default the configuration is the following:

INFO:root:Namespace(batch_size=1, dtype_input=‘int8’, dtype_output=‘int32’, global_scale=8.0, log_interval=100, model=‘resnet18_v1’, nbit_input=8, nbit_output=32, num_classes=1000, original=False, rec_val=’~/.mxnet/datasets/imagenet/rec/val.rec’, simulated=False, target=‘llvm’)

Now I see that you changed mainly the llvm target config.

Would be nice if you could share your modified evaluate.py and other files that you are using to reproduce your setup and test it on my side.

Thanks!

TriLoon · June 13, 2019, 8:33am

Sure, how can I share my files to you? how about email ?

pengzhao-intel · July 1, 2019, 4:18am

FYI, we have enabled the mobilenet v2 and update the data on new VNNI enabled machine, C5.12xlarge.
Please refer

Several SSD based models are available in GluonCV repo.

https://gluon-cv.mxnet.io/build/examples_deployment/int8_inference.html

aa12356jm · July 10, 2019, 1:15am

hi，tvm support int8 quantize on arm?

tico · July 16, 2019, 1:41pm

Hi TriLoon,

Did you manage to get the quantized version on avx2 run faster than the non-quantized version?

Thanks!

TriLoon · July 18, 2019, 1:47am

Sorry, I turned to OpenVINO to speedup my models finally ~

I am not sure whether the tvm can speedup the detection models on CPU.