[RFC] Refactor the compile_engine to expose a Relay -> TE translator

mbaret · November 9, 2020, 2:48pm

The current design of the compile_engine utilises ScheduleGetter to translate a primitive function into a scheduled tensor expression. However, as it is an all-in-one pass, this means it is directly coupled to the schedules defined in TOPI. It would instead be useful to break this into two stages, one which converts the Relay function into an unscheduled TE graph, and another which applies the TOPI-derived scheduling. We can then expose the Relay → TE translator step such that it can be reused by alternative scheduling approaches, for instance the cascading scheduling I outlined here.

In particular, I propose creating a TETranslator pass (deriving from MemoizedExpressionTranslator) and reducing the scope of ScheduleGetter so that it is just an ExprVisitor which picks out the anchor implementation and function name. The TETranslator would then be exposed as an API which could be reused by other components.

If we agree that this change would be valuable, then there is a question over how to name the Relay → TE translator component and where it should live. Here’s my current strawman:

TETranslator as a new pass in backend/compile_engine.cc
Expose the translator as a global with:

TVM_REGISTER_GLOBAL("relay.backend._TranslateToTE")
    .set_body_typed([](Function prim_func, Target target) {
      auto translator = TETranslator(target);
      return translator.Translate(prim_func);
    });

Create a python API under compile_engine.py called ‘translate_to_te’

I’ve pushed a WIP PR with this strawman which you can find here.

Thanks

comaniac · November 9, 2020, 6:38pm

Thanks for the RFC! I do agree that it would be great to create an additional path to improve the flexibility, especially now we have auto_scheduler to schedule a TE graph from scratch (cc @merrymercy).

Meanwhile, I think tightly-coupled the selection of schedule and compute is intentional because an advance schedule needs a specialized compute (e.g., NCHWc, Winograd), and that’s why Relay op strategy was design in this way to select both compute and schedule at one place (cc @haichen). Could you elaborate a bit more about this part? IIUC from your WIP PR, it seems you need to call lower_call twice (one in TETranslator and another in ScheduleGetter). In this case, seems like you still select the schedule in TETranslator (https://github.com/apache/incubator-tvm/blob/ed4cedce02a6ff608626bc61dfff6fc6f98004c9/src/relay/backend/compile_engine.cc#L259), and you perform the same process when visiting the primiary funciton (https://github.com/apache/incubator-tvm/blob/ed4cedce02a6ff608626bc61dfff6fc6f98004c9/src/relay/backend/compile_engine.cc#L266)?

A follow-up question is that since you still call lower_call in TETranslator, Relay op strategy is still required to register the mapping from Relay ops to TE computes. Accordingly, the logic of selecting a compute is still based on the schedule quality (or plevel by default), which seems not improve the flexibility but just lets you apply another schedule to the select compute. Although it seems to me that this is the main purpose of this RFC, we should have another mechanism to determine computes in TETranslator; otherwise it sounds weird to select a compute by referring to the quality of its corresponding TOPI schedule which you won’t apply.

Also cc @zhiics

mbaret · November 9, 2020, 7:37pm

it seems you need to call lower_call twice (one in TETranslator and another in ScheduleGetter ). In this case, seems like you still select the schedule in TETranslator

So yes, I do call it twice and really this is a consequence of ‘lower_call’ also probably needing a similar refactor. What I’m actually doing is ignoring the schedule information in TETranslator even though the lower_call does provide it. That way the output of the TETranslator would just be a TE Compute DAG rather than a TE Schedule. I really like that Ansor acts directly on TE rather than Relay and think that’s a pattern to work towards going forward with scheduling optimizations.

it sounds weird to select a compute by referring to the quality of its corresponding TOPI schedule which you won’t apply.

I agree with this. Perhaps we would need to provide TETranslator with a ‘StrategySelector’ that could be customized? For my envisioned use-case this happens to not be an issue, but I’d be interested in hearing opinions.

In summary, I agree that there’s probably some distance to go in completing this refactor to expose something truly flexible. Once I get some more view/opinions on the best direction to take, I can make a start on improving the WIP PR.

comaniac · November 9, 2020, 7:49pm

Thanks for clarification and now I feel we are on the same page. For the idea of StrategySelector, I have no idea for now and would like to know opinions from other as well.

mbaret · November 10, 2020, 3:33pm

@Hzfengsy @spectrometerHBH I’d be interested to hear your thoughts on this as I imagine it could have some overlap with the work you’re doing on TensorIR.

comaniac · November 11, 2020, 1:40am

Another requirement I have for the general TE translator is to support an arbitrary Relay function, including the Relay function with more than one reduce op (e.g., conv2d). The current compile engine doesn’t allow this pattern because it selects one schedule implementation per Relay function, but this should not be a limitation anymore if you are going to decouple the selection of compute and schedule. However, we probably don’t have to cover it in this RFC if that’s out of scope to you.