Base on discussions with @mbrookhart @jroesch @jwfromm
The relay IR support two kinds of normal forms — Graph code that inherently represents a DAG in a computational graph, and functional style. This document explains https://docs.tvm.ai/dev/relay_intro.html their differences and pros and cons. We adopt the text notation to distinguish the graph node(temporary expression) and let bindings as in the above link.
This RFC discusses the pros and cons of each normal form and proposes a new normal form that can help to balance the need for pattern matching and explicit scoping.
Ease of Pattern Matching
The main advantage of the graph form is the ease of pattern matching. To show the difference, let us discuss how can we match the pattern add(?, mul(?, ?))
in the code blocks below
// graph form
def @GraphF(%x, %y) {
%1 = mul(%x, %x)
%2 = add(%1, %y)
%2
}
// A-normal form
def @ANormalF(%x, %y) {
let %v1 = mul(%x, %x)
let %v2 = add(%v1, %y)
%v2
}
-
In the GraphF case, because the data structure is explicitly nested. We just need to detect a call node with add op, then checks whether one of its child is a mul.
-
In the case of the ANormalF, we will need to construct a map from variables(v1, v2) to the binded values, and lookup the binded values in order to run pattern matching.
As we can see that the direct nesting data structure in the graph form is a bit more friendly for running deep-nested pattern matching, which is one of the common needs for high-level graph optimizations.
Ambiguity of Graph Form under Scope (and Effects)
While the graph form is friendly for pattern matching and rewriting, it introduces quite a bit of ambiguity when we start to introduce scopes(function, if). In the code example below. It is unclear how should we evaluate %1. From the text representation it seems to hint that we need to evaluate %1 outside the the if scope. However, if we think about the actual data structure, %1 actually resides inside the then and else value of the if expression. See https://docs.tvm.ai/dev/relay_intro.html for related discussions
// graph form
def @GraphF(%x) {
%1 = add(%x, 1)
%2 = equal(%x, 2)
if (%2) {
mul(%1, 2)
} else {
%1
}
}
One could argue such ambiguity is good in certain cases at the high level, where the developer does not need to worry about the values. However, the real trouble comes when we start to introduce certain ops that may have side effect. The example below shows a potential ill-formed example when we start to mix graph form with effect
// ill-formed graph
%1 = load(A[0])
%2 = add(%1, 2)
%_ = store(A[0], %2)
%3 = add(%1, %1)
In the case of %3, it is unclear whether we should compute %3 before the store or after the store. Such ambiguity causes quite a lot of problems for the Tensorflow control-flow where developers need to explicit inject the control flow dependencies(which is very hard for most people).
If we use ANF with explicit let-binding to express the above program, then the semantics is clear. It also implies we cannot reorder let-scopes that involves side effect.
Discussions
To summarize the pros and cons discussed above
-
Graph form brings ease of pattern matching, but usually that means the execution can be reordered, which is OK for blocks that does not have effect, but can be dangerous when we reorder across boundaries with effect.
-
Explicit let bindings provides clear scope of computation, and order for values that need side effect.
The high level take away is that the graph form is nice, but up to a point until we want to have the graph node across scopes(especially when some of the scope brings side effect).
Ideally, we want to be able to identify dataflow-only blocks, run graph rewriting on these, but still be able to deal with scopes and bindings as in the A normal form.
Basic Block Normal Form
We introduce a new normal form “basic block normal form”(tentative name), which serves as a middle ground between the graph form and A-normal form. In particular, the requirements are:
- D0, Define a block as a group of expressions implied by the scope structure.
- D1: Each graph node can only belong to a single block
- D2: For values that are being used in multiple blocks, they have to be referred by a Var which is defined in a block
The code block below gives an example about the proposed normal form, the comment after each line of the code highlights which block each expression belongs to.
// block normal form
def @BBlockNF(%x) {
let %v1 = add(%x, 1) // block0
%2 = equal(%x, 2) // block0
if (%2) { // block0
%3 = mul(%v1, 2) // block1
%4 = add(%3, %3) // block1
%4 // block1
} else {
%v1 // block2
}
}
One important highlight is the relation between a block and scope. Note that block0 is does not include block1 and block2(that is why v1 need to be bound by the let binding), while the scope that block0 resides in encompasses the scope of block1 and block2. %2 does not need to be bounded because the cond of the if expression belongs to the same block. One way to think about this is to imagine we implement the code using CFG, then the comparison condition will be at the same block as block0, before we jump to the other ones.
Comparing this form to the CFG and SSA, the Var binding marks values that can be used for cross block value passing and makes sure each of the expression(other than Vars) belongs to a single block.
A “block” can be directly view as a basic block as in CFG. Because each of the expr node only belongs to a single block(except for Var). Most of the dataflow rewriting for the graph form can be reused safely(because we still get DAGs within a block) without worrying about polluting across scopes — because the rewriting will stop at Vars(which indicates the block boundary). Unlike CFG, where a single basic block can contain several ops with side effect and control dependencies, we will need to divide effects into different blocks and make control flow dep only happen between the blocks. So that it is always OK to do dataflow optimization within a block(which can results in reordering within a block). This decision is made base on the fact that side effect are rare compared to the normal dataflow ops in deep learning.
We can introduce “block-pass” that only rewrites the graph within each block, just like the basic block pass in CFG-based IRs. Choosing this normal form does imply that optimizations across scopes(such as common expression lifting outside the if scope, propagate layout across scopes) “becomes harder” and can not be covered by block-level optimizations.
However, such hardness is a result due to the nature of multiple scope programs. Having the distinction serves an explicit reminder for pass-writers who want to deal with cross scope optimizations, instead of making false assumptions that causes problems when dealing with multi-scope programs.
Basic Block representation
While we have the concept of a block, we do not necessarily have to introduce a specific Block data structure. Instead, a block can simply be implied by the boundary of scopes and sub-scopes. So we can directly represent this normal form under the current IR.
Summary
The block normal form is useful way to bring together advantage of graph form with the need for explicit scoping and ordering. We can envision the relay pipeline to work as follows:
model → bblock form → optimizations → bblock form → A-normal form → low-level opts → lowering to TIR.