PAPI counters with basic matmul Relay function

Wheest · October 18, 2021, 9:09am

I’ve been looking at the tutorial for running a model with PAPI performance counters.

However, I’m having difficulty running the code with a basic Relay function, e.g. the matmul_add example seen throughout the docs.

So far I’ve got this:

A, B, C, out = matmul_add(N, L, M, "float32")
s = te.create_schedule(out.op)
mod = tvm.lower(s, [A, B, C, out], name="main")
exe = relay.vm.compile(mod, target)
vm = profiler_vm.VirtualMachineProfiler(exe, dev)

report = vm.profile(
    *tvm_args,
    func_name="main",
    collectors=[tvm.runtime.profiling.PAPIMetricCollector()],
)

However, it fails with:

Check failed: (can_dispatch(n)) is false: NodeFunctor calls un-registered function on type tir.PrimFunc

All of my simple code is available here as a gist.

Does anyone have any pointers on how to get PAPI counters for basic standalone functions like this?

tkonolige · October 18, 2021, 4:30pm

The problem you are having is that the profiler and PAPI are meant to be run on an entire relay program, not on a single operator. The error you are getting is basically saying that you are passing the wrong inputs to relay.vm.compile. It expects a relay program.

You have two options here: 1. you can create a single operator relay program and run it with the profiler or 2. you can write some code that uses the PAPIMetricCollector directly (you can look at src/runtime/contrib/papi.cc).

Wheest · October 18, 2021, 4:49pm

Many thanks.

Option 2 certainly seems like the best in terms of longer term value. My knowledge of the TVM C++ side of things is patchy at best, but I’d like to learn more.

Right now however, it seems like Option 1 could be the easiest way to get the info I need? Are there any resources you can point me to that could help me understand how to make a Relay program from a single operator? Would I need to register things in a complex way? Or is it just a matter of passing the output of tvm.lower to the right function so that it’s encapsulated as a relay program?

tkonolige · October 18, 2021, 8:01pm

Option 2 would definitely be useful. I’d like to write it eventually, but I’m pretty busy.

For option 1, I think you’ll need to define your own operator and then define an implementation for it. See Adding an Operator to Relay — tvm 0.8.dev0 documentation. I’m not sure you can pass the output of tvm.lower straight to an executor. tests/python/unittest/test_runtime_heterogeneous.py appears to do something like that, but I don’t fully understand how it works.

Wheest · October 25, 2021, 4:02pm

Many thanks tkonolige, I think this is a good excuse for me to learn more about the internals of the TVM runtime and profiling.

I’ve started with making a simple C++ deployment of the matmul_add, with the goal of using it to implement Option 1.

I am following the basic structure of apps/howto_deploy (link) for my example.

Basically, I want to get it working in C++ before I try and make a nice Python wrapper, and all the layers of abstraction I’d need to break through.

I’ve been reading through the PAPI and Profiler code, and have already learned a lot. I see in the definition of the Profiler the example usage:

Device cpu, gpu;
Profiler prof({cpu, gpu});
my_gpu_kernel(); // do a warmup iteration
prof.Start();
prof.StartCall("my_gpu_kernel", gpu);
my_gpu_kernel();
prof.StopCall();
prof.StartCall("my_cpu_function", cpu);
my_cpu_function();
prof.StopCall();
prof.Stop();
std::cout << prof.Report << std::endl; // print profiling report

I am trying something similar, which might be the right way to go, using the PAPI collector as the metric collector:

tvm::Device dev = {kDLCPU, 0};
tvm::Map<tvm::runtime::profiling::DeviceWrapper, tvm::Array<tvm::String>> metrics({
   {kDLCPU,
    {"perf::CYCLES", "perf::STALLED-CYCLES-FRONTEND", "perf::STALLED-CYCLES-BACKEND",
     "perf::INSTRUCTIONS", "perf::CACHE-MISSES"}},
   {kDLCUDA, {"cuda:::event:elapsed_cycles_sm:device=0"}}});


tvm::runtime::profiling::MetricCollector papi_collector = tvm::runtime::profiling::CreatePAPIMetricCollector(metrics);

std::cout << "papi_collector created" << std::endl;

tvm::runtime::profiling::Profiler prof = tvm::runtime::profiling::Profiler({dev}, {papi_collector});
std::cout << "Profiler created" << std::endl;
f(A, B, C, out); // warmup
std::cout << "Warmup perfomed" << std::endl;
prof.Start();
prof.StartCall("matmul_add_dyn", dev);
f(A, B, C, out);
prof.StopCall();

My main issue right now is struggling with the initaliser of metrics, which CreatePAPIMetricCollector requires. It’s not clear to me how to get the typing right.

I can’t find anywhere else in the codebase that uses Map<DeviceWrapper, Array<String>>.

I have my code here, which can be cloned into tvm/apps, and run with ./run_example.sh. Compiling the PAPI example is make papi.

Any pointers on that line?

tkonolige · October 25, 2021, 5:09pm

You need to manually construct DeviceWrapper inside the initializer list.

 tvm::Map<tvm::runtime::profiling::DeviceWrapper, tvm::Array<tvm::String>> metrics({
     {tvm::runtime::profiling::DeviceWrapper({kDLCPU, 0}), {"perf::Cycles"}}
     });

Wheest · October 25, 2021, 7:12pm

This is exactly what I needed, thanks!

I’m now able to extract the PAPI counters from standalone functions by running the function exported as an .so library in C++, with the above PAPI code!

I’ll use this method to get the data I need.

Now, looking forward, I’m thinking how best to expose a Python interface to this, to try and make this more usable for others in the short-to-medium term.

Within my C++ module, I benchmark using a PackedFunc. I can get the PackedFunc from the Python side mod (i.e. output of tvm.build) by running mod.entry_func.

I guess what I would need is a Python exposed C++ interface that takes a tvm.module, the input tensors, and the target device + PAPI counters.

Then it can just return the JSON from the tvm::runtime::profiling::Report.

I’ll need to think about the best place to build this. Should it be a method of Module, or would it be better to keep it separate somehow?

EDIT

I have shown in my example that the TVM profiling system, as well as the PAPI profiler, can work without running in the Relay VM (a system I have only just learned about - fascinating idea, though I wonder what sorts of overheads we can expect).

I’m looking to see if there is a standard way of using the profiler outside of the VM, that I could hook the PAPI profiler into.

However the only usage of the profiler I can find are in the PAPI tests themselves.

We can see a very simple function being profiled in the VM in this test, but it requires a Relay Function, and compilation in the VM.

Not really what I need, given I already have a tvm.runtime.packed_func.PackedFunc.

But perhaps I can take some design cues from VirtualMachineProfiler.

tkonolige · November 23, 2021, 6:32pm

Here is a PR doing what you want: https://github.com/apache/tvm/pull/9553 (though it takes an IRModule instead of a PackedFunc).

Wheest · February 3, 2022, 6:58pm

Many thanks, in theory this should make things a lot easier. However I’m having an issue where the unit test doesn’t pass (and I can’t run it in my own workflow).

Running test_profile_function(), it fails with the error tvm._ffi.base.TVMError: TVMError: bad_function_call for line 239. It happens for me in both in the PR version, as well as the current HEAD (a45aa3e).

I’ve ensured that I have PAPI enabled on my TVM build, and I can collect metrics using the more awkward C++ deployment method I described above.

I haven’t made any modifications to the test, could there be something up with my local setup, or is there something else I’m missing?

tkonolige · February 3, 2022, 7:34pm

Not sure where your issue is coming from. It would be helpful if you could capture a backtrace (might have to do this under gdb).