Profiling Report C++

I am trying to profile in C++ the operations of the neural network. Hence I followed the ‘Example usage’ found at tvm: tvm::runtime::profiling::Profiler Class Reference

Profiler prof;
Device cpu;
prof.Start({cpu});
prof.StartCall(“my_cpu_function”, cpu);
my_cpu_function();
prof.StopCall();
prof.Stop();
std::cout << prof.Report << std::endl; // print profiling report

The trouble is in the last line: prof.Report does not exist. There is prof.Report(), which returns a Report. Yet I cannot extract info from this returned Report. The appropriate class seems to be ReportNode with function AsTable() or AsCSV(). Can someone show me how to solve this problem?

Try this:

Profiler prof;
Device cpu;
prof.Start({cpu});
prof.StartCall(“my_cpu_function”, cpu);
my_cpu_function();
prof.StopCall();
prof.Stop();
std::cout << prof.Report()->AsTable() << std::endl; // print profiling report

If there are any issue, please provide a full stack trace and/or the compilation failure.

Thanks, that made the code compile. However running that code caused a crash when executing prof.Start({cpu}). The solution is to specify Device's device_type and device_id.
My code now is:

//Set up profiler
tvm::runtime::profiling::Profiler prof;
tvm::runtime::Device cpu;
cpu.device_type = kDLCPU;
cpu.device_id = 0;
std::cout << "Start profiling\n";
prof.Start({cpu});
//prof.StartCall("my_cpu_function", cpu);
//my_cpu_function();

//Benchmark inference
tvm::runtime::PackedFunc run = executor.GetFunction("run");
const int num_runs = 1000;
for (int i = 0; i < num_runs; ++i) run();

//Print profiling info
//prof.StopCall();
prof.Stop();
std::cout << prof.Report()->AsTable() << std::endl;

This prints:

Name        Duration (us)  Percent  Device
----------
Sum
Total         11326540.22   100.00    cpu0

This is disappointing, because I was hoping to see a profile report how long each model operation (Conv2D, DepthwiseConv, etc.) would take. That is, something similar to Tensorflow Lite’s Benchmark tool (Performance measurement  |  TensorFlow Lite)
How to obtain such a profile report?

Ah, if you’d like whole model profiling results, you want to use the .profile methods on the debug graph executor or the vm profiler. Here is an example for the debug graph executor: tvm/test_runtime_profiling.py at main · apache/tvm · GitHub.

Thanks again, my Python code required minimal changes to get the desired profiling report. I ended up with:

from tvm.contrib.debugger import debug_executor as executor
...
#Profile inference
device = tvm.cpu()
e = executor.create(graph, lib, device)
e.load_params(params)
e.set_input("input_1", tvm.nd.array(test_data.astype(np.float32)))
#e.run()  #Prints timings of each layer
report = e.profile()  #Presents sum of timings of each layer type
tvm_inferred = e.get_output(0)
print("Tvm model: ", tvm_inferred)
print(report)

Subsequently I modified my C++ code from
tvm::runtime::Module executor = (*tvm::runtime::Registry::Get("tvm.graph_executor.create"))(graph, lib, device_type, device_id);
to
tvm::runtime::Module executor = (*tvm::runtime::Registry::Get("tvm.debug_executor.create"))(graph, lib, device_type, device_id);
But this generates a segmentation fault. Backtracing the segmentation fault seems to imply the debug executor does not exist. Stacktrace:

(gdb) bt
#0  0x0000005555572364 in std::_Function_base::_M_empty() const ()
#1  0x0000005555573700 in std::function<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)>::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const ()
#2  0x0000005555573938 in tvm::runtime::TVMRetValue tvm::runtime::PackedFunc::operator()<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, tvm::runtime::Module&, int&, int&>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, tvm::runtime::Module&, int&, int&) const ()
#3  0x0000005555570e20 in main ()

How to get the profiling working in C++?

Use tvm::runtime::Module executor = (*tvm::runtime::Registry::Get("tvm.graph_executor_debug.create"))(graph, lib, device_type, device_id);

Thanks, that solved the segfault. Still, when I call run() it does not print the profiling info how long each model operation takes. I extracted a profile() function from graph_executor_debug, but also that function does not print the desired profiling info. How to obtain the desired profiling info in C++? I googled for documentation and C++ examples, but I could not find anything, hence I am asking you. My current C++ code is:

tvm::runtime::Module executor = (*tvm::runtime::Registry::Get("tvm.graph_executor_debug.create"))(graph, lib, device_type, device_id);
...
//Set up profiler
tvm::runtime::PackedFunc run = executor.GetFunction("run");  //Get inference function 'run'
tvm::runtime::PackedFunc profile = executor.GetFunction("profile");
tvm::runtime::profiling::Profiler prof;
tvm::runtime::Device cpu;
cpu.device_type = kDLCPU;
cpu.device_id = device_id;
std::cout << "Start profiling\n";
prof.Start({cpu});
prof.StartCall("profile", cpu);

profile();
run();  //Run inference

//Print profiling info
prof.StopCall();
prof.Stop();
std::cout << prof.Report()->AsTable() << std::endl;
tvm::runtime::Module executor = (*tvm::runtime::Registry::Get("tvm.graph_executor_debug.create"))(graph, lib, device_type, device_id);
...
//Set up profiler
tvm::runtime::PackedFunc profile = executor.GetFunction("profile");

tvm::runtime::profiling::Report report = profile();
std::cout << report->AsTable() << std::end;

You beat me to it with your post :slight_smile: With the Python code in mind, I also thought I should try to store the return result of profile() in a Report, and subsequently print the Report. Anyway, thanks for the help. Two questions remain:

(Q1) The output of my code

tvm::runtime::PackedFunc profile = executor.GetFunction("profile");
tvm::runtime::profiling::Profiler prof;
tvm::runtime::Device cpu;
cpu.device_type = kDLCPU;
cpu.device_id = device_id;
std::cout << "Start profiling\n";
prof.Start({cpu});
prof.StartCall("profile", cpu);

tvm::runtime::profiling::Report report = profile();

//Print profiling info
prof.StopCall();
prof.Stop();
std::cout << report->AsTable() << "\n";
std::cout << prof.Report()->AsTable() << "\n";

is:

Start profiling
Name                                                Duration (us)  Percent  Device  Count
fused_nn_contrib_conv2d_NCHWc_add_add_1                    848.27     5.26    cpu0      2
fused_nn_pad_3                                             799.85     4.96    cpu0      1
fused_nn_contrib_conv2d_NCHWc_add_add_2                    701.49     4.35    cpu0      3
fused_nn_contrib_conv2d_NCHWc_add_clip_4                   700.95     4.34    cpu0      3
fused_nn_contrib_depthwise_conv2d_NCHWc_add_clip_9         664.95     4.12    cpu0      1
...
fused_nn_fast_softmax                                       12.27     0.08    cpu0      1
----------
Sum                                                      11987.52    74.30             60
Total                                                    16134.01   100.00    cpu0

Name        Duration (us)  Percent  Device  Count
profile          54044.40    99.95    cpu0      1
----------
Sum              54044.40    99.95              1
Total            54070.89   100.00    cpu0

Why do the Totals of report->AsTable() and prof.Report()->AsTable() differ so much? 16134.01 vs 54070.89

(Q2) As you can see there are several Conv2D operations. Can they be grouped, instead of being printed separately? I.e. display the total time spent on Conv2D, DepthwiseConv2D, etc.?

  1. The totals are different because you are timing different regions. prof.Report()->AsTable() is timing the model execution. report->AsTable() is timing the model execution as well as all the profiling initialization code. You always want to be using the profiling function on the executor, not constructing your own profiler.

  2. Calls will be aggregated together if they are the op and have the same input shapes. Note that fused_nn_contrib_conv2d_NCHWc_add_add_2 and fused_nn_contrib_conv2d_NCHWc_add_clip_4 are not the same as the former has two fused adds and the latter has a fused add and a clip.

(1) You’re right. So I now use the code:

tvm::runtime::PackedFunc profile = executor.GetFunction("profile");
tvm::runtime::profiling::Report report = profile();
std::cout << report->AsTable() << "\n";

This outputs:

Name                                                Duration (us)  Percent  Device  Count
fused_nn_contrib_conv2d_NCHWc_add_add_1                    848.27     5.26    cpu0      2
fused_nn_pad_3                                             799.85     4.96    cpu0      1
fused_nn_contrib_conv2d_NCHWc_add_add_2                    701.49     4.35    cpu0      3
fused_nn_contrib_conv2d_NCHWc_add_clip_4                   700.95     4.34    cpu0      3
fused_nn_contrib_depthwise_conv2d_NCHWc_add_clip_9         664.95     4.12    cpu0      1
...
fused_nn_fast_softmax                                       12.27     0.08    cpu0
----------
Sum                                                      11987.52    74.30             60
Total                                                    16134.01   100.00    cpu0

What code is responsible for the difference between Sum and Total? Is it model operations taking less time then the last-mentioned fused_nn_fast_softmax? Or is it (also) something else?

“Sum” is the sum of the times of all the operators that were run. “Total” is the time from the start of running the model to the end. Ideally these would be the same, but there is some overhead in running the model. This can be time spent doing control flow in the graph executor or vm (though this overhead is usually very small for the graph executor), or time spent collecting profiling information. If you run your model without profiling, the runtime should be similar to the “Sum” time.

You’re right, the difference between Total and Sum becomes insignificant when I time the inference without doing inference profiling.
Thanks for all the info and and thanks for all the help.

For those interested, I wrote a simple Python script to extract from the profiling report the total time and percentage of Conv2D and DepthwiseConv2D (the most expensive operations in my network). You can easily replace/extend them with operations you’re interested in.

f = open('profiling_report.txt', 'r')
sum_conv2d_time = 0.0
sum_conv2d_perc = 0.0
sum_depthwise_conv2d_time = 0.0
sum_depthwise_conv2d_perc = 0.0
for line in f:
	splitted_line = line.split()
	if splitted_line:  #Splitted line is not empty
		if "contrib_conv2d" in splitted_line[0]:
			sum_conv2d_time += float(splitted_line[1])
			sum_conv2d_perc += float(splitted_line[2])
		elif "contrib_depthwise_conv2d" in splitted_line[0]:
			sum_depthwise_conv2d_time += float(splitted_line[1])
			sum_depthwise_conv2d_perc += float(splitted_line[2])

print("Sum time conv2d:", sum_conv2d_time/1000.0, "ms")
print("Sum perc conv2d:", sum_conv2d_perc)
print("Sum time depthwise conv2d:", sum_depthwise_conv2d_time/1000.0, "ms")
print("Sum perc depthwise conv2d:", sum_depthwise_conv2d_perc)
1 Like

While you could parse back the profiling report, I’d recommend just accessing the report object itself:

report = my_mo.profile()

conv2d_sum = 0
conv2d_perc = 0
for call in report.calls:
    if "conv2d" in call["Name"]:
        conv2d_sum += call["Duration (us)"].microseconds
        conv2d_perc += call["Percent"].percent

print(conv2d_sum)
print(conv2d_perc)

This example is in python, but you could do the same thing in C++.