[RFC] A general task extraction mechanism for auto_scheduler

comaniac · November 12, 2020, 1:53am

The current approach used by auto_scheduler to extract tuning tasks leverages Relay op strategy. In short, auto_scheduler registers an implementation in Relay op strategy as AutoTVM, but instead of using the TOPI schedule function, auto_scheduler creates an empty schedule and extracts the lowered TE compute function as a tuning task (ref: https://github.com/apache/incubator-tvm/blob/main/python/tvm/relay/op/op.py#L147).

However, an obvious issue of this approach is that the scope of a tuning task is limited by Relay compile engine and op strategy. Specifically, each primitive Relay function can only have at most one complicated op (i.e., reduce ops like conv2d). Relay compile engine will mark that op as the anchor op (ref: https://github.com/apache/incubator-tvm/blob/main/src/relay/backend/compile_engine.cc#L231), and use the TOPI schedule of that op to schedule an entire Relay function (ref: https://github.com/apache/incubator-tvm/blob/main/src/relay/backend/compile_engine.cc#L152).

Here is a motivating example:

def @main(%data: Tensor[(1, 3, 224, 224), float32], %weight1: Tensor[(32, 3, 3, 3), float32], %weight2: Tensor[(32, 32, 3, 3), float32]) {
  %3 = fn (%data1: Tensor[(1, 3, 224, 224), float32], %weight11: Tensor[(32, 3, 3, 3), float32], %weight21: Tensor[(32, 32, 3, 3), float32], Primitive=1) {
    %0 = nn.conv2d(%data1, %weight11, padding=[1, 1, 1, 1], kernel_size=[3, 3]);
    %1 = nn.relu(%0);
    %2 = nn.conv2d(%1, %weight21, padding=[1, 1, 1, 1], kernel_size=[3, 3]);
    nn.relu(%2)
  };
  %3(%data, %weight1, %weight2)
}

As can be seen, we manually set %3 to primitive so that it won’t be partitioned to two separate functions after the FuseOps pass. If we simply build this function, we will get the follow error message:

Check failed: !anchor_op_.defined() || anchor_op_pattern_ < kCommReduce == false: Cannot apply TOPI schedule to a primitive function with two complicated ops anchor=Op(nn.conv2d) current=Op(nn.conv2d)

As a result, the goal of this RFC is to propos a mechanism that is able to make the above Relay function as an auto_scheduler tuning task, and we can also build it with the tuning logs.

The proposed mechanism is:

Add a mode, use_topi_schedule, to Relay compile engine. When use_topi_schedule=true, it performs as it is. When use_topi_schedule=false, we do not check if this function has more than one reduce ops but simply invokes auto_schedule_topi for an entire TE compute.
Propagate the flag use_topi_schedule all the way to GraphRuntimeCodegen and relay.Build.
1. In auto_scheduler.extract_tasks, we set use_topi_schedule=false so that it can extract tasks.
2. In relay.build, we use auto_scheduler.DispatchContext.current to judge whether we should query auto_scheduler schedule for an entire function, or query TOPI schedule of the anchor op.

The draft PR is available here. Note that since we now extract auto_scheduler tasks directly via compile engine, we completely removed auto_scheduler related logics from Relay op strategy.

I also provide a running script here if you are willing to play with more Relay functions.

One issue of this mechanism that hasn’t been fully addressed is now we will collect too many tasks, including the tasks with only placeholders or the tasks with only one layout transform op (on CPU). Here are two possible solutions:

S1: Let task scheduler judge: Since auto_scheduler has a task scheduler, it should be fine to send all tasks to the task scheduler, and the task scheduler should not spend any time on those tasks after analyzing their comput DAG.

S2: Embed heuristic or customized rules: Depending on the use cases, we may find some useful rules to easily prune out tasks during the extraction process. For example, we can ignore the task without any call node, or ignore the task with less than N injective ops.

Comments and suggestions are welcome

cc @merrymercy @tqchen @jcf94 @zhiics @haichen

giuseros · November 12, 2020, 2:43pm

Hi @comaniac,

May I ask how the graph ends up with a nn.conv2d + nn.relu + nn.conv2d + nn.relu ? Is the graph going through a BYOC kind of partitioning (sorry if the question is naive)?

As for S1 vs S2, could we do both? Use an heuristic like “ignore the task without any call node” and then let the task scheduler judge if it needs to spend time on the task?

Thanks

comaniac · November 12, 2020, 6:35pm

Thanks for the comments.

May I ask how the graph ends up with a nn.conv2d + nn.relu + nn.conv2d + nn.relu ? Is the graph going through a BYOC kind of partitioning (sorry if the question is naive)?

There is nothing to do with BYOC. My point is that Ansor opens the door to subgraph-level scheduling, but the current Relay fusion mechanism limits a task to a subgraph with a single reduce op. With this RFC, people can investigate fusion strategies without any limitations. For example, one of our talk in the upcoming TVM conference (title: Graph-Level Scheduling Optimization with Polyhedral Analysis for Tensor Programs) will introduce our initial efforts on investigating the opportunities of fusing reduce ops. This kind of researches will be hard to proceed without this RFC.

As for S1 vs S2, could we do both? Use an heuristic like “ignore the task without any call node” and then let the task scheduler judge if it needs to spend time on the task?

Yes we could. This issue is actually about the flexibility vs. user experience. If we filter tasks during extraction process, users will see fewer tasks (and may feel more comfortable), but advance users may want to see all tasks to perform other things such as improving the task scheduler, etc.

zhiics · November 12, 2020, 10:20pm

This looks okay to me. But I have one comment because this sounds like we need to add one more argument to the build interface which users may not need to know the details. Another possible option is that we can bake it into PassContext as a config. However, I understand that this configure is really not a pass config. @tqchen do you have any better suggestion?

comaniac · November 13, 2020, 1:29am

Here are some more details about the interface change in this RFC. The new added use_topi_schedule flag is propagated from the compile engine to relay._build. As a result, this actually doesn’t expose to users. The use cases are the following:

Use TOPI schedule with fallback values (same as current).

with PassContext(opt_level=opt_level):
    lib = relay.build(mod, target=target, params=params)

Use TOPI schedule with AutoTVM tuning logs (same as current).

with autotvm.apply_history_best(log_file):
    with PassContext(opt_level=opt_level):
        lib = relay.build(mod, target=target, params=params)

Extract auto_scheduler tasks. It calls GraphRuntimeCodegen(use_topi_schedule=False) to launch the compile engine in order to lower the Relay functions to TE compute DAGs.

tasks, task_weights = auto_scheduler.extract_tasks(mod["main"], params, target)

In extract_tasks:

with transform.PassContext(opt_level=3):
    opt_mod, _ = relay.optimize(mod, target, params)
    grc = graph_runtime_codegen.GraphRuntimeCodegen(None, target, use_topi_schedule=False)
    grc.codegen(opt_mod["main"])

Use auto_scheduler tuning logs. In relay.build, it invokes relay._build(use_topi_schedule=False) because it finds the auto_scheduler.DispatchContext is not None, meaning that users want to apply the auto_scheduler log.
```
with auto_scheduler.ApplyHistoryBest(log_file):
    with PassContext(opt_level=opt_level):
        lib = relay.build(mod, target=target, params=params)
```

As a result, the changes are hid from an end-user’s point of view. On the other hand, putting this flag to the PassContext results in the change of case 3 and case 4:

In extract_tasks, we can still add the flag for users.

with transform.PassContext(opt_level=3, use_topi_schedule=False):
    opt_mod, _ = relay.optimize(mod, target, params)
    grc = graph_runtime_codegen.GraphRuntimeCodegen(None, target)
    grc.codegen(opt_mod["main"])

Users have to manually add the flag anyways.

with auto_scheduler.ApplyHistoryBest(log_file):
    with PassContext(opt_level=opt_level, use_topi_schedule=False):
        lib = relay.build(mod, target=target, params=params)

IMHO, this changes the interface more than the current solution.

junrushao · November 13, 2020, 7:34am

CC: @Hzfengsy @spectrometerHBH if you guys are interested

moderato · November 13, 2020, 7:37am

This is interesting work. I’m curious if the plan is that in the future auto_scheduler would not rely on any new custom Relay ops for tuning subgraphs like the example that is shown, i.e. it directly tunes a primitive function as the user designates?

I’m working on a similar thing of subgraph tuning. Now it runs with topi schedules + pattern matching + customer Relay ops, without auto_scheduler being involved yet. I’m looking to working on more complex subgraphs, and would love to know if there’s a chance to collaborate!

comaniac · November 13, 2020, 8:26am

We haven’t planned that far yet, as currently we lower a Relay function to a TE compute, which relies on Relay op strategy to map Relay ops to TOPI computes.

I’m not familiar with custom Relay ops, but it would be great if you have any suggestion that could make this RFC potentially work for custom Relay ops in the future (or even just follow up PRs). My current subgraph workloads include zero or more than one reduce ops, we can definitely seek for collaborations in this direction

tkonolige · November 13, 2020, 6:27pm

I’m not super familiar with autotvm and auto scheduling, but I’ve got a couple questions:

What is the interaction between autoscheduler and autotvm in the future. Will we be unifying the user api for autotvm and auto scheduling? Can you mix auto scheduling and autotvm?
Why is the GraphRuntimeCodegen responsible for extracting tasks? Couldn’t we just do it in a pass?

comaniac · November 13, 2020, 7:02pm

We haven’t figured out the plan yet, but mixing them up is definitely a trend.
In order to make task extraction and schedule application align, we follow the same flow as building a model to extract tasks. Both AutoTVM and auto_scheduler leverage this approach.

haichen · November 13, 2020, 7:53pm

I have one question about use_topi_schedule. I assume that after we set it to False, it will always use the Ansor scheduler to schedule the ops. Will there be a case that we want have a mix of topi schedule and ansor schedule?

comaniac · November 13, 2020, 8:18pm

This is a good question. This is possible for the current implementation, because we use Relay op strategy to define auto_scheduler tasks as well. In other words, we use Relay FuseOps to define the task scope, and should be able to choose to use TOPI (AutoTVM) or auto_scheduler schedule for each task. However, as you pointed out, this option becomes unavailable after this RFC as we separate their task extraction and compilation paths.

An alternative solution might be keeping all approaches in the same place. Specifically, we still keep the current task extraction (i.e., having strategy.add_auto_scheduler in Relay op strategy). When use_topi_schedule=True, we can still extract auto_scheduler tasks as for now. The case of use_topi_schedule=False would become totally optional.

What do you think about this proposal, or do you have any suggestion on this? Also cc @merrymercy.

tqchen · November 13, 2020, 9:36pm

I agree it could be part of the PassContext, but perhaps not at the top level as opt_level, but more as a sub-level attribute, like the other attributes in loop unrolling

comaniac · November 13, 2020, 11:33pm

So you meant the use case would be like the following?

with auto_scheduler.ApplyHistoryBest(log_file):
    with PassContext(opt_level=opt_level, config={use_topi_schedule: False}):
        lib = relay.build(mod, target=target, params=params)

tqchen · November 14, 2020, 1:05am

with auto_scheduler.ApplyHistoryBest(log_file):
    with PassContext(opt_level=opt_level, config={
           "relay.CompileEngine": { use_topi_schedule: False }
    }):
        lib = relay.build(mod, target=target, params=params)

comaniac · November 15, 2020, 6:20am

Thanks all for the valuable feedback. Here is the summary of the finalized RFC:

Interface

When extracting auto_scheduler tasks, users simply call extract_tasks.
extract_tasks now has an optional flag include_simple_task (default False). When set, each Relay function, including the one with only injective ops, becomes an auto_scheduler task.

When building a model, users need to use with tvm.transform.PassContext(config={"relay.backend.use_auto_scheduler": True}) to apply auto_scheduler tuning logs.

Changes in Compile Engine

Compile engine checks relay.backend.use_auto_scheduler to determine whether to use auto_scheduler schedules. If true, then compile engine calls auto_schedule_topi.
In the task extraction mode, auto_schedule_topi extracts a task and returns an initial schedule. Since we are not going to actually compile the model in this mode, whether the initial schedule is valid doesn’t matter.
In the build mode, auto_schedule_topi queries the auto_scheduler tuning log for the given workload. If success, then it returns the tuned schedule; otherwise it returns None. In this case, compile engine falls back to use TOPI schedule (with AutoTVM tuning logs if provided) for this workload to make sure it can be compiled and executed. As a result, mixing the use of auto_scheduler and AutoTVM schedules is also supported.

The implementation is ready for review in the PR:

github.com/apache/tvm

[AutoSchedule] Extract tasks via compile engine

main ← comaniac:ansor_general_task

opened 01:51AM - 12 Nov 20 UTC

comaniac

+298 -248

See RFC: https://discuss.tvm.apache.org/t/rfc-a-general-task-extraction-mechanis…m-for-auto-scheduler/8444 for details. ## Change highlight: - auto_scheduler enabling. - Implicitly enable when calling `auto_scheduler.extract_tasks`. - Explicitly enable using `PassContext(config={"relay.backend.use_auto_schedule": True})` when compilation. - Remove `auto_scheduler.enable_relay_integration`. - Task extraction. - Add a flag `include_simple_task=False`. When setting it to `True`, the tasks with only injective ops will be extracted as well. - Extract tasks with Relay op strategy, but only rely on the TOPI computes and ignore the TOPI schedules, meaning that the selection of TOPI compute is purely based on `plevel`. - Skip the compute with unsupported ops (e.g., `extern`, `hybrid` ops). - Schedule querying. - Change `auto_scheduler.FallbackContext.silent` to `verbose`. It can be 0 (silent), 1 (only warn the missing configs for workload with complex ops, default), and 2 (warn all missing configs). This PR is ready for review. cc @merrymercy @tqchen @zhiics @jcf94 @icemelon9. ## TODO - [x] Collect feedback from RFC. - [x] Avoid collecting tasks with only simple ops per users' choice. - [x] Deal with fallback schedules for non-collected tasks on GPU. - [x] Add unit tests. - [x] Specify the use of auto_scheduler in PassContext. - [x] Surpass the fallback warnings for simple tasks.

xjwang · March 31, 2022, 4:04am

Hi @comaniac, I run into the same issue when dealing with function like this:

fn (%x: Tensor[(64, 64), int8], %w: Tensor[(64), int8], %b: Tensor[(64), int8], %y: Tensor[(64, 64), int8]) {
  %0 = qnn.aie.layer_norm(%x, %w, %b, epsilon=1.52588e-05f, ln_shift=7, affine_shift=8);
  nn.dense(%0, %y, units=64, out_dtype="int8")
}

error msg:

Check failed: (!anchor_op_.defined() || anchor_op_pattern_ < kCommReduce) is false: Cannot apply TOPI schedule to a primitive function with two complicated ops anchor=Op(qnn.aie.layer_norm) current=Op(nn.dense)

qnn.aie.layer_norm and nn.dense both have their own schedule strategy registered for a 3rd target, just using relay.build rather than AutoTVM, and just want to use these schedules to each of OPs rather than fuse them. Is this can be achieved?

Can’t find the patched code related to this RFC any more, so may existed some refactoring I guess. What should I do for next step? Thanks very much.

comaniac · March 31, 2022, 5:29pm

In your case there’s nothing to do with auto-scheduler. This fused function cannot be scheduled by TVM anyways. You probably need to check why these two ops are fused together, given that both of them include reduction in their computes.

xjwang · April 1, 2022, 4:41pm

Thanks for your quick reply, it make sense. I’m trying to construct the relay IR to separate OPs into different Functions, but can’t pass checking when relay.build. I create a topic here: