I am having trouble to profile my model following the PAPI Tutorial. As soon as I get it working, I’ll post the results here.
Any help is welcome
EDIT : Opening a new thread here for this profiling-specific problem
EDIT 2 :
I managed to get the profiler working : This is the first output for my NN
Name Duration (us) Percent Device Count Argument Shapes Hash VM::Argument Shapes perf::CACHE-MISSES perf::CYCLES perf::INSTRUCTIONS perf::STALLED-CYCLES-BACKEND perf::STALLED-CYCLES-FRONTEND weight_layout
vm_mod_fused_reshape_add_exp_subtract_nn_relu_multiply_nn_relu_add_multiply_add_reshape 1,579.71 50.61 cpu0 5 float32[3000, 20], float32[20], float32[20], float32[20], float32[3000, 20] bb23e7419c2e3451 3,330 5,278,160 18,092,178 2,486,365 11,025
vm_mod_fused_nn_contrib_dense_pack_1 408.29 13.08 cpu0 4 float32[3000, 20], float32[1, 20, 20], float32[3000, 20] c6382300161e7b61 4,120 1,379,177 5,025,882 866,665 5,082 NC20n
vm_mod_fused_nn_contrib_dense_pack_2 42.32 1.36 cpu0 1 float32[3000, 20], float32[1, 20, 8], float32[3000, 8] 939e1bd5242e3085 764 145,090 501,736 92,334 2,607 NC8n
VM::AllocStorage 13.74 0.44 cpu0 12 841 85,172 69,685 8,946 26,549
vm_mod_fused_nn_contrib_dense_pack 12.02 0.39 cpu0 1 float32[3000, 2], float32[1, 2, 20], float32[3000, 20] 9dc8805547517187 598 43,390 133,645 23,230 561 NC20n
VM::AllocTensor 9.14 0.29 cpu0 10 float32[3000, 20] 988 57,520 63,576 4,745 9,389
VM::UnknownOp 8.71 0.28 cpu0 25 2,988 89,378 98,315 5,586 9,086
vm_mod_fused_reshape_add 4.47 0.14 cpu0 1 float32[3000, 8], float32[8], float32[3000, 1, 8] 69c860389db3a274 512 17,349 35,091 8,236 329
VM::ReshapeTensor 4.11 0.13 cpu0 1 329 27,531 5,332 10,114 8,889
VM::AllocTensor 1.51 0.05 cpu0 1 float32[3000, 8] 84 9,099 7,139 1,798 2,546
VM::AllocTensor 0.81 0.03 cpu0 1 float32[3000, 1, 8] 77 5,173 7,516 192 200
----------
Sum 2,084.83 66.79 62 14,631 7,137,039 24,040,095 3,508,211 76,263
Total 3,121.32 cpu0 1 40,550 9,801,638 25,860,685 4,808,036 439,378
Configuration
-------------
Number of threads: 1
Executor: VM
A few important things :
-
I could not profile on the same CPU I should be running (the Intel Xeon Planitum) because of a PAPI error :
PAPIError: -7 Event does not exist: perf::STALLED-CYCLES-FRONTEND. So I ran it on another CPU to get an intuition of performance -
I am not sure to fully understand though, when building with MKL, AutoTVM detects 12 Tasks to optimize. Without MKL it finds 6 tasks. In the above profiler’s output, I count 11 tasks. Maybe I am understanding something wrong.
-
In order to profile only my optimized model, I used
with autotvm.apply_history_best, then I build my lib withlib = relay.build(mod, target=target, params=params)so I have the optimized one. Then I update myparamswithparams = lib.get_params()using the optimized lib. Does this flow seem correct ?
EDIT 3 : I set TVM_NUM_THREADS=8 to see if there was any difference. I get x4 on the bad perfs I had before. The performance is way better, but still very long compared to cppflow. I am going to continue to explore this lead