Both the mlc-ai and tlc-pack has the Relax project, and it seems they are developing independently. So which one should we follow, if we want to use relax and TensorIR to integerate our new hardware backend?
Glad to see relax is getting more interest from the community. tlc-pack
is our primary codebase for relax development, while mlc-ai
is a fork from tlc-pack
for demonstration and teaching.
BTW, I’m happy to offer help if you’d like to integrate into new hardware
Thank you for your reply. Our NPU backend is based on CGRA architecture, and we have developed the compiler based on LLVM. Currently, restricted by the host chip, which using an commerical compiler, we want to use Relax to generate the device code for neural network inference. We are facing with some questions before digging into the Relax code
- How to attach the TensorIR script and its manual schedule to Relax op?
- How to reuse the relay optimzation pass and qnn dialect?
- Does Relax/TensorIR can solve the varaible input shape problems like the input image size changes in CNN structure?
Relax allows users to write TensorIR function and relax function in one IRModule. Also, we can use IRModule pass to modify/update/apply schedules.
That’s a good question. Codebase reuse is important, however, the relax AST is incompatible with relay passes. We can not simply reuse relay passes. One work around is that we can import model to relay and then translate to relax after applying relay passes.
It’s also a problem about dyn shape. Relax and TensorIR both support to “express” symbolic shape, but we still do not have an e2e tuning solution for workloads with symbolic shape.
If we use handwrite schedule in our backend, can we avoid this problem?
About relay pass reuse, the RFC below gives a solution which using a custom relay-to-tir pass lowering relay op to TensorIR. What’s your opinion about this solution, prons and cons?
Introducing TY-NNP backend with end2end TensorIR integration
Sorry for the late response.
I’m not sure since we have not tried yet, however, there is no known issue for symbolic shape.
relay-tir
is a proven e2e solution. The problem is that it may meet the limitation of Relay (e.g. symbolic shape, cross-kernel optimization, etc.)
Thank you. I tried the dynamic case which is annotated in tests/python/relax/test_autoir_integration.py
@T.prim_func
def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None:
T.func_attr({"global_symbol": "tir_matmul"})
m = T.var("int32")
n = T.var("int32")
k = T.var("int32")
A = T.match_buffer(x, (m,n))
B = T.match_buffer(y, (n,k))
C = T.match_buffer(z, (m,k))
for (i0, j0, k0) in T.grid(m,n,k):
with T.block():
i,j,k = T.axis.remap("SSR", [i0,j0,k0])
with T.init():
C[i,j] = 0.0
C[i,j] += A[i,k] * B[j,k]
@T.prim_func
def tir_relu(x:T.handle, y:T.handle):
T.func_attr({"global_symbol": "tir_relu"})
m = T.var("int32")
n = T.var("int32")
A = T.match_buffer(x, (m,n))
B = T.match_buffer(y, (m,n))
for (i,j) in T.grid(m,n):
with T.block():
vi, vj = T.axis.remap("SS", [i, j])
B[vi, vj] = T.max(A[vi, vj], 0.0)
@R.function
def main(x:Tensor((m,n), "float32"), w:Tensor((n,k), "float32")) -> Tensor:
with R.dataflow():
sh = relax.call_packed("vm.builtin.shape_of", x)
x0 = relax.match_shape(sh, (m, n))
sh1 = relax.call_packed("vm.builtin.shape_of", w)
x1 = relax.match_shape(sh1, (n, k))
lv0 = R.call_tir(tir_matmul, (x, w), (m, k), dtype="float32")
lv1 = R.call_tir(tir_relu, (lv0), (m, k), dtype="float32")
relax.output(lv1)
return lv1
But I got the following error. Any help will be appreciated
I have found a dynamic shape relax example in this tutorial.
Currently, the relax has an VM executor. Is there any plan to support graph executor and memory plan for relax? @Hzfengsy
Hi @feiyulv,
Thanks for your interest in Relax! There were some minor api updates, you can find the latest version of the tutorial in this gist that works on the current relax codebase: relax_demo.ipynb · GitHub.
There is no plan to add a graph executor to Relax, the main reason is we need to do dynamic memory allocation in dynamic shape cases while graph executor works on static memory allocation, and we have tested that the current Relax VM performance can match the performance of Relay graph executor for several models.
Memory planning is an important pass, and several folks are working on it (related PRs: https://github.com/tlc-pack/relax/pull/255, https://github.com/tlc-pack/relax/pull/250). The memory planning pass will be executor-agnostic, i.e. it will work across VM and AOT executors.
Thank you for your help