Compared two similar Bert models running on CPU with TVM, one is PyTorch model, the other is MXNet model. Due to the large performance difference, I did some profiling. The result shows the run time of the same operation(matmul) with same workload varies big.
ENV:
- TVM: build with MKL.
- Intel CPU
- OpenMP:
KMP_AFFINITY=compact,1,0 OMP_NUM_THREADS=24
Model inference time:
# mxnet model
TVM Mean inference time: 5.53 ms
# pytorch model
TVM Mean inference time: 23.05 ms
Profiling result:
# MXNet model
Node Name Ops. Time(us) Time(%) Shape. Inputs Outputs
---------
fused_nn_dense_add_15 fused_nn_dense_add_1 308.926 5.58 (32, 768) 3 1
fused_nn_dense_add_11 fused_nn_dense_add_1 307.277 5.551 (32, 768) 3 1
# PyTorch Model
Node Name Ops. Time(us) Time(%) Shape. Inputs Outputs
---------
fused_nn_dense_add_3 fused_nn_dense_add_3 1783.75 7.631 (32, 768) 3 1
fused_nn_dense_add_31 fused_nn_dense_add_3 1593.08 6.815 (32, 768) 3 1
IR code (same between PyTorch model and MXNet model)
attr [0] "compute_scope" = "fused_nn_dense_add_3_compute_";
attr [C: handle] "storage_scope" = "global";
allocate(C, float32, [24576]) {
attr [0] "extern_scope" = 0;
@tir.tvm_call_packed("tvm.contrib.cblas.matmul", @tir.tvm_stack_make_array(placeholder, @tir.tvm_stack_make_shape(32, 3072, dtype=handle), 0, 2, 0f32, 0, dtype=handle), @tir.tvm_stack_make_array(placeholder_1, @tir.tvm_stack_make_shape(768, 3072, dtype=handle), 0, 2, 0f32, 0, dtype=handle), @tir.tvm_stack_make_array(C, @tir.tvm_stack_make_shape(32, 768, dtype=handle), 0, 2, 0f32, 0, dtype=handle), False, True, dtype=int32)
for (ax0: int32, 0, 32) "parallel" {
for (ax1: int32, 0, 768) {
T_add[((ax0*768) + ax1)] = ((float32*)C[((ax0*768) + ax1)] + (float32*)placeholder_2[ax1])
}
}
However, when setting OMP_NUM_THREADS=1
the model inference time is same, seems it’s a problem with multiple threads.
What may cause the difference?
Refer to: https://github.com/apache/incubator-tvm/issues/6354