Performance of same op and workload in different model varies differently

Compared two similar Bert models running on CPU with TVM, one is PyTorch model, the other is MXNet model. Due to the large performance difference, I did some profiling. The result shows the run time of the same operation(matmul) with same workload varies big.

ENV:

  1. TVM: build with MKL.
  2. Intel CPU
  3. OpenMP: KMP_AFFINITY=compact,1,0 OMP_NUM_THREADS=24

Model inference time:

# mxnet model
TVM Mean inference time: 5.53 ms
# pytorch model
TVM Mean inference time: 23.05 ms

Profiling result:

# MXNet model
Node Name                           Ops.                Time(us)   Time(%)  Shape.  Inputs  Outputs
--------- 
fused_nn_dense_add_15        fused_nn_dense_add_1       308.926   5.58     (32, 768)      3       1
fused_nn_dense_add_11         fused_nn_dense_add_1       307.277   5.551    (32, 768)        3       1

# PyTorch Model
Node Name                           Ops.                Time(us)   Time(%)  Shape.  Inputs  Outputs
--------- 
fused_nn_dense_add_3        fused_nn_dense_add_3       1783.75    7.631    (32, 768)     3       1
fused_nn_dense_add_31      fused_nn_dense_add_3        1593.08    6.815    (32, 768)    3       1

IR code (same between PyTorch model and MXNet model)

  attr [0] "compute_scope" = "fused_nn_dense_add_3_compute_";
  attr [C: handle] "storage_scope" = "global";
  allocate(C, float32, [24576]) {
    attr [0] "extern_scope" = 0;
    @tir.tvm_call_packed("tvm.contrib.cblas.matmul", @tir.tvm_stack_make_array(placeholder, @tir.tvm_stack_make_shape(32, 3072, dtype=handle), 0, 2, 0f32, 0, dtype=handle), @tir.tvm_stack_make_array(placeholder_1, @tir.tvm_stack_make_shape(768, 3072, dtype=handle), 0, 2, 0f32, 0, dtype=handle), @tir.tvm_stack_make_array(C, @tir.tvm_stack_make_shape(32, 768, dtype=handle), 0, 2, 0f32, 0, dtype=handle), False, True, dtype=int32)
    for (ax0: int32, 0, 32) "parallel" {
      for (ax1: int32, 0, 768) {
        T_add[((ax0*768) + ax1)] = ((float32*)C[((ax0*768) + ax1)] + (float32*)placeholder_2[ax1])
      }
    }

However, when setting OMP_NUM_THREADS=1 the model inference time is same, seems it’s a problem with multiple threads.

What may cause the difference?

Refer to: https://github.com/apache/incubator-tvm/issues/6354

1 Like

However, when setting OMP_NUM_THREADS=1 the model inference time is same, seems it’s a problem with multiple threads.

Will it be possible that there’s any thread realated limitation in your pytorch script?

There is no thread related ops. Besides, multi threads is faster than one threads.