Issue on profiling models with PAPI

I was experiencing an issue while following the “How To Guide” Getting Started With PAPI¶. I’ve had PAPI installed, and built tvm with papi. However, the output of the code in the link above doesn’t output as expected, but as below.

Name                                  Duration (us)  Percent  Device  Count                                                    Argument Shapes              Hash  VM::Argument Shapes  
fused_nn_dense_nn_bias_add_nn_relu            24.12    23.20    cpu0      1  float32[1, 784], float32[128, 784], float32[128], float32[1, 128]  35ac6d50e6e03a62                       
fused_nn_dense_nn_bias_add_nn_relu_1           6.52     6.27    cpu0      1     float32[1, 128], float32[64, 128], float32[64], float32[1, 64]  7c89e1efbba1ce3b                       
fused_nn_dense_nn_bias_add                     4.91     4.72    cpu0      1       float32[1, 64], float32[10, 64], float32[10], float32[1, 10]  8a679957c4723fed                       
VM::AllocStorage                               3.82     3.68    cpu0      5                                                                                             float32[3136]  
fused_nn_batch_flatten                         1.39     1.33    cpu0      1                             float32[1, 1, 28, 28], float32[1, 784]  cafe14d2106368be                       
VM::AllocTensor                                0.99     0.96    cpu0      2                                                     float32[1, 10]                                         
fused_nn_softmax                               0.74     0.71    cpu0      1                                     float32[1, 10], float32[1, 10]  0cc19816e7a3c070                       
VM::AllocTensor                                0.65     0.63    cpu0      1                                                    float32[1, 784]                                         
VM::AllocTensor                                0.58     0.56    cpu0      1                                                     float32[1, 64]                                         
VM::AllocTensor                                0.49     0.47    cpu0      1                                                    float32[1, 128]                                         
----------                                                                                                                                                                             
Sum                                           44.21    42.54             15                                                                                                            
Total                                        103.94             cpu0      1 

Also, chaging the metrics to be collected didn’t hava an effect on the output. The output of the code remained same. I’ve tried papi_avail command and it showed

Of 108 possible events, 0 are available, of which 0 are derived.

No events detected!  Check papi_component_avail to find out why.

And papi_component_avail command showed

Compiled-in components:
Name:   perf_event              Linux perf_event CPU counters
   \-> Disabled: Unknown libpfm4 related error
Name:   perf_event_uncore       Linux perf_event CPU uncore and northbridge
   \-> Disabled: No uncore PMUs or events found

I reinstalled perf with command apt-get install linux-tools-common linux-tools-generic linux-tools-`uname -r` and now perf -v command shows perf version 6.5.13. But the issue above still exists. Does anyone know how to solve it?

I got some clues from the links PAPI on Alder Lake? ; Linux support for various PMUs ; large number of unsupported papi counters on 13th Gen Intel Core i7-13800H . It seems that libpfm4 does not support new CPUs and thus PAPI fails to get CPU events. So I turned to an older computer with Intel Core i7-6700 CPU. “papi_avail” command now finally showed me some possible events Of 108 possible events, 59 are available, of which 18 are derived. and finally got some outputs if I set the specific metrics, which are tagged “avail - Yes” in “papi_avail”, to be collected, such as

report = vm.profile(
    data,
    func_name="main",
    collectors=[tvm.runtime.profiling.PAPIMetricCollector({dev: ["PAPI_SP_OPS"]})],
)

And here’s the output.

Name                                  Duration (us)  Percent  Device  Count                                                    Argument Shapes              Hash  PAPI_SP_OPS  VM::Argument Shapes  
fused_nn_dense_nn_bias_add_nn_relu            38.01     5.35    cpu0      1  float32[1, 784], float32[128, 784], float32[128], float32[1, 128]  35ac6d50e6e03a62      200,960                       
fused_nn_dense_nn_bias_add                    14.35     2.02    cpu0      1       float32[1, 64], float32[10, 64], float32[10], float32[1, 10]  8a679957c4723fed        1,290                       
fused_nn_dense_nn_bias_add_nn_relu_1          14.28     2.01    cpu0      1     float32[1, 128], float32[64, 128], float32[64], float32[1, 64]  7c89e1efbba1ce3b       16,512                       
VM::AllocStorage                               5.62     0.79    cpu0      5                                                                                                 0        float32[3136]  
VM::AllocTensor                                1.99     0.28    cpu0      2                                                     float32[1, 10]                              0                       
VM::AllocTensor                                1.52     0.21    cpu0      1                                                    float32[1, 784]                              0                       
fused_nn_softmax                               1.35     0.19    cpu0      1                                     float32[1, 10], float32[1, 10]  0cc19816e7a3c070           40                       
fused_nn_batch_flatten                         0.99     0.14    cpu0      1                             float32[1, 1, 28, 28], float32[1, 784]  cafe14d2106368be            0                       
VM::AllocTensor                                0.80     0.11    cpu0      1                                                    float32[1, 128]                              0                       
VM::AllocTensor                                0.75     0.11    cpu0      1                                                     float32[1, 64]                              0                       
----------                                                                                                                                                                                          
Sum                                           79.65    11.20             15                                                                                           218,802                       
Total                                        710.99             cpu0      1                                                                                           218,802                       

For some reason, the output is still different from the one in the guide. And if I don’t set any metrics to be collected, a perf related error message occurs:

Traceback (most recent call last):
  File "/home/augustine/yyz_workspace/test_papi.py", line 15, in <module>
    report = vm.profile(
  File "/home/augustine/tvm/python/tvm/runtime/profiler_vm.py", line 91, in profile
    return self._profile(func_name, collectors)
  File "/home/augustine/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  3: TVMFuncCall
  2: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::TypedPackedFunc<tvm::runtime::profiling::Report (tvm::runtime::String, tvm::runtime::Array<tvm::runtime::profiling::MetricCollector, void>)>::AssignTypedLambda<tvm::runtime::vm::VirtualMachineDebug::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::String, tvm::runtime::Array<tvm::runtime::profiling::MetricCollector, void>)#1}>(tvm::runtime::vm::VirtualMachineDebug::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::String, tvm::runtime::Array<tvm::runtime::profiling::MetricCollector, void>)#1})::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  1: tvm::runtime::profiling::Profiler::Profiler(std::vector<DLDevice, std::allocator<DLDevice> >, std::vector<tvm::runtime::profiling::MetricCollector, std::allocator<tvm::runtime::profiling::MetricCollector> >)
  0: tvm::runtime::profiling::PAPIMetricCollectorNode::Init(tvm::runtime::Array<tvm::runtime::profiling::DeviceWrapper, void>)
  File "/home/augustine/tvm/src/runtime/contrib/papi/papi.cc", line 199
PAPIError: -7 Event does not exist: perf::STALLED-CYCLES-FRONTEND.

It seems that PAPI still cannot collect information from perf. I’m still trying to figure out why.