[DISCUSS] All-in-one Build API and Pass API Composability

This is a high level meta-RFC for the general design principles around APIs involving passes and optimizations of the TVM stack. As we start to add new passes and optimizations to the infrastructure, there is an increasing tension between two styles of API designs.

To take an concrete example, imagine that we want to add a preprocessing option that quantizes a f32 model into int8, there are two possible options

Option 1: Add Options to the All-in-one Build API

mod = relay.frontend.from_keras(model_name)

with relay.BuildConfig(quantize_model=True, quantize_start_layer=1):
    result = relay.build(mod)

In this case, relay.build serves as an all-in-one API that does everything, and the additional switches are just like -Wall switches in the compiler that switches things on and off.

Option 2: API Composition

mod = relay.fronend.from_keras(model_name)
mod = relay.quantize.Quantize(from_layer=1)(mod)
result = relay.build(mod)

We call another quantize pass before we do build. The advantage of this approach is that this is more composable. Imagine that I want to do additional pass after quantization, for example, change my layout to a customized layout that fits into the accelerator, we could insert a pass to do so

mod = relay.fronend.from_keras(model_name)
mod = relay.transform.Sequantial(
          relay.transform.ConvertLayout(from="NHWC", to="NHWC4c")])(mod)
result = relay.build(mod)

Summary of proposal in this RFC

This RFC advocate for option2. Note that once we have option 2, we could also build a customized pipeline that exposes an API like option1. Many of our proposed APIs started looking like option1, because option1 is the API that traditional compilers exposes through CLI.

However, the possibility of optimization choices brings the need to explore possible optimization pipeline patterns. Just like the same as the need for exploring neural network architectures. Today, we are getting used to composable APIs that construct resnet by layers, and then invoke it through a fit function. We can do the same for the pass API, with the analogy(pass <-> layer, fit <-> build)

Please share your thoughts on this, and we can collectively have a meta-guideline that helps us in future API designs


cc @jroesch @zhiics @ajtulloch @FrozenGene @janimesh @yzhliu

+1 for option 2.

One of the concerns I have is that users may not be aware of some optimizations that they need to invoke. For example, in the QNN legalize case, they probably don’t know that there are qnn ops present in the graph. Then, they would not know whether or not they want to invoke the sequential pass. Or there might be some cases that they want to invoke the dialect passes in between some Relay standard passes. This would also complicate the design choice. I may overthink the problem.

+1 for option 2 with concerns.

I really like a high flexible approach for pass development. It’s definitely a good way for developers to develop, experiment, and integrate new passes. We could also setup multiple pipeline patterns like now so that users can simply use them without knowing details.

The risky here is the compatibility and dependency between passes are hard to be maintained or resolved. Of course every pass owner should put comprehensive checkers to make sure the pass is working as expected, but this is hard to be perfect, especially the checker might need to be updated in the future due to other the change of passes. Ideally, we should have a unified format or language to describe such rules, but this part is also vague to me.

To specifically address legalization API problem, a potential solution to this is to always call Legalize pass, regardless of the presence of the dialects, assuming all dialects have the same legalization mechanism, we might just be fine.

This is an interesting tradeoff. If a user knows to pass in dialect name to the build function(as in option 1), then likely the user can do it pragmatically as well. If the user do not know know to pass, then we will need to have a Legalization api which the user can use that is agnostic to the dialect itself

For now, for QNN, it is different. We call Legalize but with a string that isolates the Legalize for QNN ops only - Legalize(“FTVMQnnLegalize”). The idea is that qnn.conv2d wants to go through legalize to go to nn.conv2d, which then can later go through normal Legalize. So, in this case, we need to call Legalize more than once - (first with FTVMQnnLegalize and then later FTVMLegalize).

If the user do not know know to pass, then we will need to have a Legalization api which the user can use that is agnostic to the dialect itself

Yeah, we need something like this. But, I am having difficulty figuring out how to do this.

We can discuss the design possibilities further.

If a string need to be passed as a config option(rather than having a generic Legalize), then the user already needs to be aware of the fact of QNN dialect, so option2 likely would be fine here.

Prefer Option 2. But I think we could have option 1 support if there is a need in the future , and I do think we will need it in the near future, because for normal users, option 1 is better, he doesn’t need to consider too much detail. Like we call compiler O3, LLVM call many passes internally. Especially when our passes become complex, the pass order will also become one problem for normal users.

So, that is to say, when we have
with relay.BuildConfig(quantize_model=True, quantize_start_layer=1):
We could call internally
mod = relay.fronend.from_keras(model_name) mod = relay.quantize.Quantize(from_layer=1)(mod) result = relay.build(mod)
to help users. Our framework should orient developers but also orient normal users.

Maybe we can try to think from a user perspective to see if it makes sense.

Suppose a user wants to compile a prequantied TFLite model. Should that person know that there is a QNN in there? Should that person have a separate pipeline for FP32 models vs Pre-quantized models?

Maybe, QNN is too specific to fall into one or the other bucket. For example, as you suggested, automatic quantization in Relay is very specific where the user is aware of what one is trying to do. Not so much with QNN.

Another option may be to allow passes to be inserted into the sequence of passes that is run as part of relay.Build(…). If passes are named then you might have functions like passmgr.run_before(“ConstFolding”, mypass) or passmgr.run_after(…), and passmgr.remove(…). If an earlier pass, or a front end, realises that a specific pass is necessary, it could add it without a user being aware of it. Another advantage is that this could be controlled from a command-line, just pass in a list of pass names you want to run.

What you said(user’s perspective of needing a good default pipeline) totally makes sense. I will elaborate in another post.

What I meant was that we should either design the API into two categories and layer them(build good default API with composable pipeline constructions). In the case of QNN, I see ideally two types of APIs. 1) For the cases where a user is aware of QNN dialect, option2 makes sense, as passing in dialect as an argument to build has a similar mental complexity as constructing a QNN specific legalization pass before build.
2) for cases where a user just want to make things work without knowing the presence of which dialect, we should build a generic Legalize transform(that might call into QNN’s legalize pass) that will be called by(tflite frontend or build or by the user).

Trying to summarize the discussion so far. Thanks for everyone who shared thoughts so far.

Many agrees that option2 is good for better developer to experiment with alternatives. But @FrozenGene@janimesh bring a good point about for an end user who do not necessary know about the pipeline constructions, they want a good default pipeline that just works.

@leo-arm suggested to follow the traditional compiler pipeline approach to have a sequential pipeline with configurable flags.

@FrozenGene also made a very good observation that we can build an user facing API for default pipeline using APIs in option2.

I will try to add a few clarifications about the proposal

The need for good default pipeline

I think everyone seems to agree that we need a good default pipeline, this is what the build function does so far. The main point we bought up in this proposal is only about how to deal with variabilities (e.g. choice of quantization).

The first post suggested that when the variability happen, it might be better to do it pragmatically in a composable manner. The user would otherwise have to add a parameter to the all-in-one function, which is similar to add another line before calling the build function.

If our goal is to refine the default pipeline, we should strive to make it simple, perhaps being composed with good defaults, so users do not have to pass in any additional parameters.

The need to go beyond a single good pipeline

ML optimization is a quite complicated landscape and is still a wide open area. We can hardly say there is one unique set of rules that will gives us the optimal code. The default pipeline itself also need to be involved, perhaps by allowing developers to try out different pipeline compositions on cases such as quantizations choices, different layouts, fusion options. Of course the eventual goal would be automation, e.g. use to explore a collections of choices and then pick the best one.

The open nature of ML optimizations brings the need to allow developers to pragmatically construct pipelines and inspect intermediate optimization results. These APIs also becomes compositional tools that allows us to collect statistics, run profiles, and visualize in an interactive env.

Users of the API

This discussion also touches on the user categories. There are two types of users:

  • a) Developers who use TVM as an infrastructure library, to transform their model, have some thoughts about trying out quantization, layout, and other choices of optimizations.
  • b) Users who use TVM as a normal compiler or an inference engine, and just want to call build and ship the model.

APIs in option2 style definitely makes the users in category (a) more productive. And our hope is by building high level APIs(good default pipelines) on top of the composable API, we can quickly explore the possible optimizations, and bring better optimizations to users in category (b)

1 Like

Yes, if the user is aware, option 2 is a better design.

  • We might not be able to call it in the TFLite frontend because QNN has 2 passes, one is target-dependent.
  • So, lets say we create a generic Legalize transformation and call it inside build. The question then is where do we call it and whats the sequence. If we have multiple Legalize calls, do we need to call other non-Legalize passes between different Legalize calls? QNN passes, as of now, keep the graph untouched if there are no QNN ops. So, functionally, it might be ok to collect all such dialect passes that have no side-effect and call them before relay.build.

PS - Thanks for managing the discussion. I am asking more questions here than proposing solutions :slight_smile:

My understanding wrt to legalization is that we should always call legalize(be it target dependent or not) before invoking other optimizations. This makes sure that the optimization is in the core and always is composable. It would be interesting to ask if it is OK to provide one legalization pass(by fold the target invariant one into the dependent one) that knows the target and be done with it.

Aah, actually this might work. We can call Legalize before anything else. Let me see if this can be coded easily without leaking dialect code in BuildModule.

@tqchen Tried to make Legalize somewhat generic. Let me know if that seems like a good idea.