PAPI counters with basic matmul Relay function

I’ve been looking at the tutorial for running a model with PAPI performance counters.

However, I’m having difficulty running the code with a basic Relay function, e.g. the matmul_add example seen throughout the docs.

So far I’ve got this:

A, B, C, out = matmul_add(N, L, M, "float32")
s = te.create_schedule(out.op)
mod = tvm.lower(s, [A, B, C, out], name="main")
exe = relay.vm.compile(mod, target)
vm = profiler_vm.VirtualMachineProfiler(exe, dev)

report = vm.profile(
    *tvm_args,
    func_name="main",
    collectors=[tvm.runtime.profiling.PAPIMetricCollector()],
)

However, it fails with:

Check failed: (can_dispatch(n)) is false: NodeFunctor calls un-registered function on type tir.PrimFunc

All of my simple code is available here as a gist.

Does anyone have any pointers on how to get PAPI counters for basic standalone functions like this?

The problem you are having is that the profiler and PAPI are meant to be run on an entire relay program, not on a single operator. The error you are getting is basically saying that you are passing the wrong inputs to relay.vm.compile. It expects a relay program.

You have two options here: 1. you can create a single operator relay program and run it with the profiler or 2. you can write some code that uses the PAPIMetricCollector directly (you can look at src/runtime/contrib/papi.cc).

Many thanks.

Option 2 certainly seems like the best in terms of longer term value. My knowledge of the TVM C++ side of things is patchy at best, but I’d like to learn more.

Right now however, it seems like Option 1 could be the easiest way to get the info I need? Are there any resources you can point me to that could help me understand how to make a Relay program from a single operator? Would I need to register things in a complex way? Or is it just a matter of passing the output of tvm.lower to the right function so that it’s encapsulated as a relay program?

Option 2 would definitely be useful. I’d like to write it eventually, but I’m pretty busy.

For option 1, I think you’ll need to define your own operator and then define an implementation for it. See Adding an Operator to Relay — tvm 0.8.dev0 documentation. I’m not sure you can pass the output of tvm.lower straight to an executor. tests/python/unittest/test_runtime_heterogeneous.py appears to do something like that, but I don’t fully understand how it works.

Many thanks tkonolige, I think this is a good excuse for me to learn more about the internals of the TVM runtime and profiling.

I’ve started with making a simple C++ deployment of the matmul_add, with the goal of using it to implement Option 1.

I am following the basic structure of apps/howto_deploy (link) for my example.

Basically, I want to get it working in C++ before I try and make a nice Python wrapper, and all the layers of abstraction I’d need to break through.

I’ve been reading through the PAPI and Profiler code, and have already learned a lot. I see in the definition of the Profiler the example usage:

Device cpu, gpu;
Profiler prof({cpu, gpu});
my_gpu_kernel(); // do a warmup iteration
prof.Start();
prof.StartCall("my_gpu_kernel", gpu);
my_gpu_kernel();
prof.StopCall();
prof.StartCall("my_cpu_function", cpu);
my_cpu_function();
prof.StopCall();
prof.Stop();
std::cout << prof.Report << std::endl; // print profiling report

I am trying something similar, which might be the right way to go, using the PAPI collector as the metric collector:

tvm::Device dev = {kDLCPU, 0};
tvm::Map<tvm::runtime::profiling::DeviceWrapper, tvm::Array<tvm::String>> metrics({
   {kDLCPU,
    {"perf::CYCLES", "perf::STALLED-CYCLES-FRONTEND", "perf::STALLED-CYCLES-BACKEND",
     "perf::INSTRUCTIONS", "perf::CACHE-MISSES"}},
   {kDLCUDA, {"cuda:::event:elapsed_cycles_sm:device=0"}}});


tvm::runtime::profiling::MetricCollector papi_collector = tvm::runtime::profiling::CreatePAPIMetricCollector(metrics);

std::cout << "papi_collector created" << std::endl;

tvm::runtime::profiling::Profiler prof = tvm::runtime::profiling::Profiler({dev}, {papi_collector});
std::cout << "Profiler created" << std::endl;
f(A, B, C, out); // warmup
std::cout << "Warmup perfomed" << std::endl;
prof.Start();
prof.StartCall("matmul_add_dyn", dev);
f(A, B, C, out);
prof.StopCall();

My main issue right now is struggling with the initaliser of metrics, which CreatePAPIMetricCollector requires. It’s not clear to me how to get the typing right.

I can’t find anywhere else in the codebase that uses Map<DeviceWrapper, Array<String>>.

I have my code here, which can be cloned into tvm/apps, and run with ./run_example.sh. Compiling the PAPI example is make papi.

Any pointers on that line?

You need to manually construct DeviceWrapper inside the initializer list.

 tvm::Map<tvm::runtime::profiling::DeviceWrapper, tvm::Array<tvm::String>> metrics({
     {tvm::runtime::profiling::DeviceWrapper({kDLCPU, 0}), {"perf::Cycles"}}
     });

This is exactly what I needed, thanks!

I’m now able to extract the PAPI counters from standalone functions by running the function exported as an .so library in C++, with the above PAPI code!

I’ll use this method to get the data I need.

Now, looking forward, I’m thinking how best to expose a Python interface to this, to try and make this more usable for others in the short-to-medium term.

Within my C++ module, I benchmark using a PackedFunc. I can get the PackedFunc from the Python side mod (i.e. output of tvm.build) by running mod.entry_func.

I guess what I would need is a Python exposed C++ interface that takes a tvm.module, the input tensors, and the target device + PAPI counters.

Then it can just return the JSON from the tvm::runtime::profiling::Report.

I’ll need to think about the best place to build this. Should it be a method of Module, or would it be better to keep it separate somehow?

EDIT

I have shown in my example that the TVM profiling system, as well as the PAPI profiler, can work without running in the Relay VM (a system I have only just learned about - fascinating idea, though I wonder what sorts of overheads we can expect).

I’m looking to see if there is a standard way of using the profiler outside of the VM, that I could hook the PAPI profiler into.

However the only usage of the profiler I can find are in the PAPI tests themselves.

We can see a very simple function being profiled in the VM in this test, but it requires a Relay Function, and compilation in the VM.

Not really what I need, given I already have a tvm.runtime.packed_func.PackedFunc.

But perhaps I can take some design cues from VirtualMachineProfiler.

Here is a PR doing what you want: https://github.com/apache/tvm/pull/9553 (though it takes an IRModule instead of a PackedFunc).