Profile on Relay Level?

max1996 · March 30, 2021, 7:47am

For a project, I want to train a number of models that can predict the execution time of a layer (from its relay description) on different hardware targets.

My current problem is, that I am unable to find a nice option to do this. The Debug Runtime measures the execution time for the low level functions, which include fused layers and cannot be directly mapped to relay nodes.

I looked into the Auto-Scheduler, as Ansor also works on a subgraph level, but it seems like it it also using measuring individual TIR functions.

I would like to work with the Relay representation as it enables the targeting of BYOC backends, which might be more relevant for highly heterogeneous targets.

comaniac · March 30, 2021, 5:20pm

Since Relay is a graph-level IR, its ops do not have the compute and schedule but just the input and output types, latency measurement has to happen at the TIR level. If you want to profile the latency of each op, you could turn off op fusion.

However, simply turn off fusion will result in errors, because TVM requires every op to be in a primitive function during lowering. The right way to turn off fusion is writing a simple Relay pass that puts every single op to a function. For example:

%1 = nn.conv2d(...)
%2 = nn.bias_add(%1, ...)
%3 = nn.relu(%2)

becomes

%1 = fn(..., Primitive=1) {
  nn.conv2d(...)
}
%2 = %1(...)
%3 = fn(..., Primitive=1) {
  nn.bias_add(...)
}
%4 = %3(%2, ...)
%5 = fn(..., Primitive=1) {
  nn.relu(...)
}
%6 = %5(...)

Then each function will contain a single op.

On the other hand, I personally don’t recommend this profiling approach, because in the normal compilation flow op fusion would definitely happen. If you would like to know whether offloading some ops to your device could improve the end-to-end performance, you should compare the latency of a fused function vs. the latency of offloading this function to your device to get a fair conclusion.

max1996 · March 31, 2021, 6:45am

Thank you a lot. yes, that makes a lot of sense.

I had hoped to build a automatic mapping solution for highly heterogeneous solutions.

Is there a way to get the relay nodes that correspond to the individual TIR functions?

comaniac · March 31, 2021, 4:52pm

The approach I suggested is the most straightforward one. Relay to TIR is not one-to-one mapping. A Relay node may be lowered to different TIR functions for different target and input shapes/dtype.

max1996 · April 1, 2021, 7:18am

I am bit confused, maybe I misunderstood your suggestion.

I am using the debug executor to measure the latency of the individual (fused) TIRfunctions, but I cannot tell which function corresponds to which part of the original/optimized relay graph. (example of TIR function name: fused_layout_transform_nn_batch_flatten)

So I am aware of the n:m mapping between Relay nodes and TIR functions, however, I would like to keep information about filter sizes and which operations are fused in the TIR functions. As the model to predict the performance needs additional information.

max1996 · April 12, 2021, 6:49am

Hi @comaniac,

sorry to bother you again, but is there an easy way to store which Relay nodes have been fused to which patterns and the correspondence between Relay nodes and the TIR functions they have been lowered to?

comaniac · April 12, 2021, 4:57pm

I don’t think there’s an easy to do so for now, unfortunately. As I mentioned, the clearest way is creating a model snippet and compare the Relay model and lowered TIR function manually, but it would be tedious if you are interested in an entire large model.

max1996 · April 13, 2021, 5:25am

I tried to find the code, which lowers the Relay representation to TIR (or TE), but was unsuccessful.

Can you point me to the correct place? I still want to try to automate it, even if it requires more development effort.