This is exactly what I needed, thanks!
I’m now able to extract the PAPI counters from standalone functions by running the function exported as an .so library in C++, with the above PAPI code!
I’ll use this method to get the data I need.
Now, looking forward, I’m thinking how best to expose a Python interface to this, to try and make this more usable for others in the short-to-medium term.
Within my C++ module, I benchmark using a PackedFunc. I can get the PackedFunc from the Python side mod (i.e. output of tvm.build) by running mod.entry_func.
I guess what I would need is a Python exposed C++ interface that takes a tvm.module, the input tensors, and the target device + PAPI counters.
Then it can just return the JSON from the tvm::runtime::profiling::Report.
I’ll need to think about the best place to build this. Should it be a method of Module, or would it be better to keep it separate somehow?
EDIT
I have shown in my example that the TVM profiling system, as well as the PAPI profiler, can work without running in the Relay VM (a system I have only just learned about - fascinating idea, though I wonder what sorts of overheads we can expect).
I’m looking to see if there is a standard way of using the profiler outside of the VM, that I could hook the PAPI profiler into.
However the only usage of the profiler I can find are in the PAPI tests themselves.
We can see a very simple function being profiled in the VM in this test, but it requires a Relay Function, and compilation in the VM.
Not really what I need, given I already have a tvm.runtime.packed_func.PackedFunc.
But perhaps I can take some design cues from VirtualMachineProfiler.