Good use of the C++ TVM Runtime

I am having trouble to profile my model following the PAPI Tutorial. As soon as I get it working, I’ll post the results here.

Any help is welcome

EDIT : Opening a new thread here for this profiling-specific problem

EDIT 2 :

I managed to get the profiler working : This is the first output for my NN

Name                                                                                     Duration (us)  Percent  Device  Count                                                              Argument Shapes              Hash  VM::Argument Shapes  perf::CACHE-MISSES  perf::CYCLES  perf::INSTRUCTIONS  perf::STALLED-CYCLES-BACKEND  perf::STALLED-CYCLES-FRONTEND  weight_layout  
vm_mod_fused_reshape_add_exp_subtract_nn_relu_multiply_nn_relu_add_multiply_add_reshape       1,579.71    50.61    cpu0      5  float32[3000, 20], float32[20], float32[20], float32[20], float32[3000, 20]  bb23e7419c2e3451                                    3,330     5,278,160          18,092,178                     2,486,365                         11,025                 
vm_mod_fused_nn_contrib_dense_pack_1                                                            408.29    13.08    cpu0      4                     float32[3000, 20], float32[1, 20, 20], float32[3000, 20]  c6382300161e7b61                                    4,120     1,379,177           5,025,882                       866,665                          5,082          NC20n  
vm_mod_fused_nn_contrib_dense_pack_2                                                             42.32     1.36    cpu0      1                       float32[3000, 20], float32[1, 20, 8], float32[3000, 8]  939e1bd5242e3085                                      764       145,090             501,736                        92,334                          2,607           NC8n  
VM::AllocStorage                                                                                 13.74     0.44    cpu0     12                                                                                                                                     841        85,172              69,685                         8,946                         26,549                 
vm_mod_fused_nn_contrib_dense_pack                                                               12.02     0.39    cpu0      1                       float32[3000, 2], float32[1, 2, 20], float32[3000, 20]  9dc8805547517187                                      598        43,390             133,645                        23,230                            561          NC20n  
VM::AllocTensor                                                                                   9.14     0.29    cpu0     10                                                            float32[3000, 20]                                                        988        57,520              63,576                         4,745                          9,389                 
VM::UnknownOp                                                                                     8.71     0.28    cpu0     25                                                                                                                                   2,988        89,378              98,315                         5,586                          9,086                 
vm_mod_fused_reshape_add                                                                          4.47     0.14    cpu0      1                            float32[3000, 8], float32[8], float32[3000, 1, 8]  69c860389db3a274                                      512        17,349              35,091                         8,236                            329                 
VM::ReshapeTensor                                                                                 4.11     0.13    cpu0      1                                                                                                                                     329        27,531               5,332                        10,114                          8,889                 
VM::AllocTensor                                                                                   1.51     0.05    cpu0      1                                                             float32[3000, 8]                                                         84         9,099               7,139                         1,798                          2,546                 
VM::AllocTensor                                                                                   0.81     0.03    cpu0      1                                                          float32[3000, 1, 8]                                                         77         5,173               7,516                           192                            200                 
----------                                                                                                                                                                                                                                                                                                                                                                            
Sum                                                                                           2,084.83    66.79             62                                                                                                                                  14,631     7,137,039          24,040,095                     3,508,211                         76,263                 
Total                                                                                         3,121.32             cpu0      1                                                                                                                                  40,550     9,801,638          25,860,685                     4,808,036                        439,378                 

Configuration
-------------
Number of threads: 1
Executor: VM

A few important things :

  • I could not profile on the same CPU I should be running (the Intel Xeon Planitum) because of a PAPI error : PAPIError: -7 Event does not exist: perf::STALLED-CYCLES-FRONTEND. So I ran it on another CPU to get an intuition of performance

  • I am not sure to fully understand though, when building with MKL, AutoTVM detects 12 Tasks to optimize. Without MKL it finds 6 tasks. In the above profiler’s output, I count 11 tasks. Maybe I am understanding something wrong.

  • In order to profile only my optimized model, I used with autotvm.apply_history_best, then I build my lib with lib = relay.build(mod, target=target, params=params) so I have the optimized one. Then I update my params with params = lib.get_params() using the optimized lib. Does this flow seem correct ?

EDIT 3 : I set TVM_NUM_THREADS=8 to see if there was any difference. I get x4 on the bad perfs I had before. The performance is way better, but still very long compared to cppflow. I am going to continue to explore this lead