I’m comparing the same network in different implementations: first is an executable written in MKLDNN cpp code directly the second in vanilla python pytorch which is then exported to ONNX, ran through AutoTVM and executed as cpp generated binary.
The network has several convolutional layers and the input is of standard image size. In both cases I run only single core on an x86 server.
Does it make sense that my MKLDNN implementation is running 70x faster?
Hi Gushu, I think 70X is definitely not a reasonable result. I compare the RN50 fp32 performance between highly mkldnn-optimized TF and TVM. I see mkldnn only has almost 1.5x performance faster.
And I still see lots of warning in my autoTVM log such as
WARNING:autotvm:Cannot find config for target=llvm -device=tracing, workload=('conv2d_NCHWc.x86', ('TENSOR', (128, 512, 7, 7), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
So I think my autoTVM tuning can be further improved but I still don’t know how to do it currently.