Performance of same op and workload in different model varies differently

nolanliou · August 31, 2020, 2:33am

Compared two similar Bert models running on CPU with TVM, one is PyTorch model, the other is MXNet model. Due to the large performance difference, I did some profiling. The result shows the run time of the same operation(matmul) with same workload varies big.

ENV:

TVM: build with MKL.
Intel CPU
OpenMP: KMP_AFFINITY=compact,1,0 OMP_NUM_THREADS=24

Model inference time:

# mxnet model
TVM Mean inference time: 5.53 ms
# pytorch model
TVM Mean inference time: 23.05 ms

Profiling result:

# MXNet model
Node Name                           Ops.                Time(us)   Time(%)  Shape.  Inputs  Outputs
--------- 
fused_nn_dense_add_15        fused_nn_dense_add_1       308.926   5.58     (32, 768)      3       1
fused_nn_dense_add_11         fused_nn_dense_add_1       307.277   5.551    (32, 768)        3       1

# PyTorch Model
Node Name                           Ops.                Time(us)   Time(%)  Shape.  Inputs  Outputs
--------- 
fused_nn_dense_add_3        fused_nn_dense_add_3       1783.75    7.631    (32, 768)     3       1
fused_nn_dense_add_31      fused_nn_dense_add_3        1593.08    6.815    (32, 768)    3       1

IR code (same between PyTorch model and MXNet model)

  attr [0] "compute_scope" = "fused_nn_dense_add_3_compute_";
  attr [C: handle] "storage_scope" = "global";
  allocate(C, float32, [24576]) {
    attr [0] "extern_scope" = 0;
    @tir.tvm_call_packed("tvm.contrib.cblas.matmul", @tir.tvm_stack_make_array(placeholder, @tir.tvm_stack_make_shape(32, 3072, dtype=handle), 0, 2, 0f32, 0, dtype=handle), @tir.tvm_stack_make_array(placeholder_1, @tir.tvm_stack_make_shape(768, 3072, dtype=handle), 0, 2, 0f32, 0, dtype=handle), @tir.tvm_stack_make_array(C, @tir.tvm_stack_make_shape(32, 768, dtype=handle), 0, 2, 0f32, 0, dtype=handle), False, True, dtype=int32)
    for (ax0: int32, 0, 32) "parallel" {
      for (ax1: int32, 0, 768) {
        T_add[((ax0*768) + ax1)] = ((float32*)C[((ax0*768) + ax1)] + (float32*)placeholder_2[ax1])
      }
    }

However, when setting OMP_NUM_THREADS=1 the model inference time is same, seems it’s a problem with multiple threads.

What may cause the difference?

Refer to: https://github.com/apache/incubator-tvm/issues/6354

jcf94 · August 31, 2020, 6:12am

However, when setting OMP_NUM_THREADS=1 the model inference time is same, seems it’s a problem with multiple threads.

Will it be possible that there’s any thread realated limitation in your pytorch script?

nolanliou · August 31, 2020, 6:31am

There is no thread related ops. Besides, multi threads is faster than one threads.