The current approach used by auto_scheduler to extract tuning tasks leverages Relay op strategy. In short, auto_scheduler registers an implementation in Relay op strategy as AutoTVM, but instead of using the TOPI schedule function, auto_scheduler creates an empty schedule and extracts the lowered TE compute function as a tuning task (ref: https://github.com/apache/incubator-tvm/blob/main/python/tvm/relay/op/op.py#L147).
However, an obvious issue of this approach is that the scope of a tuning task is limited by Relay compile engine and op strategy. Specifically, each primitive Relay function can only have at most one complicated op (i.e., reduce ops like conv2d). Relay compile engine will mark that op as the anchor op (ref: https://github.com/apache/incubator-tvm/blob/main/src/relay/backend/compile_engine.cc#L231), and use the TOPI schedule of that op to schedule an entire Relay function (ref: https://github.com/apache/incubator-tvm/blob/main/src/relay/backend/compile_engine.cc#L152).
Here is a motivating example:
def @main(%data: Tensor[(1, 3, 224, 224), float32], %weight1: Tensor[(32, 3, 3, 3), float32], %weight2: Tensor[(32, 32, 3, 3), float32]) {
%3 = fn (%data1: Tensor[(1, 3, 224, 224), float32], %weight11: Tensor[(32, 3, 3, 3), float32], %weight21: Tensor[(32, 32, 3, 3), float32], Primitive=1) {
%0 = nn.conv2d(%data1, %weight11, padding=[1, 1, 1, 1], kernel_size=[3, 3]);
%1 = nn.relu(%0);
%2 = nn.conv2d(%1, %weight21, padding=[1, 1, 1, 1], kernel_size=[3, 3]);
nn.relu(%2)
};
%3(%data, %weight1, %weight2)
}
As can be seen, we manually set %3
to primitive so that it wonāt be partitioned to two separate functions after the FuseOps
pass. If we simply build this function, we will get the follow error message:
Check failed: !anchor_op_.defined() || anchor_op_pattern_ < kCommReduce == false: Cannot apply TOPI schedule to a primitive function with two complicated ops anchor=Op(nn.conv2d) current=Op(nn.conv2d)
As a result, the goal of this RFC is to propos a mechanism that is able to make the above Relay function as an auto_scheduler tuning task, and we can also build it with the tuning logs.
The proposed mechanism is:
- Add a mode,
use_topi_schedule
, to Relay compile engine. Whenuse_topi_schedule=true
, it performs as it is. Whenuse_topi_schedule=false
, we do not check if this function has more than one reduce ops but simply invokesauto_schedule_topi
for an entire TE compute. - Propagate the flag
use_topi_schedule
all the way toGraphRuntimeCodegen
andrelay.Build
.- In
auto_scheduler.extract_tasks
, we setuse_topi_schedule=false
so that it can extract tasks. - In
relay.build
, we useauto_scheduler.DispatchContext.current
to judge whether we should query auto_scheduler schedule for an entire function, or query TOPI schedule of the anchor op.
- In
The draft PR is available here. Note that since we now extract auto_scheduler tasks directly via compile engine, we completely removed auto_scheduler related logics from Relay op strategy.
I also provide a running script here if you are willing to play with more Relay functions.
One issue of this mechanism that hasnāt been fully addressed is now we will collect too many tasks, including the tasks with only placeholders or the tasks with only one layout transform op (on CPU). Here are two possible solutions:
S1: Let task scheduler judge: Since auto_scheduler has a task scheduler, it should be fine to send all tasks to the task scheduler, and the task scheduler should not spend any time on those tasks after analyzing their comput DAG.
S2: Embed heuristic or customized rules: Depending on the use cases, we may find some useful rules to easily prune out tasks during the extraction process. For example, we can ignore the task without any call node, or ignore the task with less than N injective ops.
Comments and suggestions are welcome