Hello! I’m reading the following paper published on ATC19.
I have a few questions here: is the speedup of NeoCPU against other frameworks like MXNet, TF, etc, mainly contributed by kernel tuning or graph tuning? If the speedup mainly comes from graph tuning, how’s the performance of the auto-tuned CONV layers compared to the proprietory library behind other frameworks, say, MKL-DNN?
Thanks in advance! @yzhliu @kevinthesun
Any thoughts here? Thanks!
@moderato The performance advantage comes from the joint optimization of both kernel and graph level optimization. For conv2d kernels, in lots of cases MKLDNN outperforms TVM. However, with graph tuning we can achieve better e2e performance by carefully arranging data layout.
I see. Graph tuning inserts data transformation to some or all layers, which means an increase of e2e latency for TVM. As for many conv2d kernels TVM is slower than MKLDNN, how can TVM still achieve better e2e runtime? Did I miss any other kind of latency that MKLDNN will incur here?
Graph tuning is to balance the tradeoff between data layout overhead and kernel with fast schedule, MKLDNN needs to be integrated with deep learning framework to execute a NN. In this process, overheads such as data layout transformation can generate. Another major overhead comes from breaking fusion.
1 Like
Can you explain what do you mean by “breaking fusion”?
Integrating MKLDNN into deep learning framework requires special handle for the fusion of those operators which will be accelerated using MKLDNN kernel, sometimes even patten matching nn blocks. It’s quite difficult to develop a general fusion rule in this situation.
1 Like
I see. Thanks for the reply!