Introducing TY-NNP backend with end2end TensorIR integration

Hi, all~

This RFC is to upstream the support for our TY-NNP accelerator backend. We are from the AI accelerator toolchain team of Intellifusion, who has been focusing on developing vision processor that accelerates deep neural networks in visual recognition and searching in endpoints, such as IP cameras and robots, as well as in cloud.

Nowadays, TVM has become the most important component in our AI software stack and we would like to upstream our work back. We believe participating in the open-source ecosystem will benefit both the internal software infrastructures and our customers!

Overall architecture

The TY-NNP refers to the neural network accelerator architecture serving a wide range of our edge AI scenarios. TY-NNP takes a typical NPU design to offload neural network computation workloads to various kinds of domain-specified designed computing units. Generally, there are three kinds of computing units:

  • NU (neural units)

    NU is designed for high-throughput computation of typical neural-network workloads such as Conv/Matmul. Comparing to TensorCores in NVGPU, NU works in a coarse-grained fashion from a software perspective. Instead of software-programming of fine-grained M * N * K mma intrinsics, NU provides CISC-style instructions and a bundle of hardware configurations to developers. The NU components automatically load input/weight data from input buffers, execute fine-grained mma operations with hardware tiling control, and store result to output buffers.

    In TVM, we program NU with customized TIR intrinsics. Developers should use schedules to lower the specified computation patterns to NU intrinsics, arrange the on-chip input/output buffers, and perform tuning to determine the best hardware configurations.

  • VU (vector units)

    VU accelerates general computation workloads which can not fit NU. TY-NNP provides a set of on-chip VU cores, each taking its own on-chip buffer (called VM), and a set of vectorized/scalar function units and physical registers. VU programming is just like general vectorized programming on CPUs.

    In TVM, to offload the computation to VU, developers should schedule the computations into vectorizable form, arrange the on-chip input/output buffers, and mark the proper computation axis with vectorize or replace it with VU intrinsics.

  • CU (control units)

    CU can be seen as a small on-chip core and does not provide high computation abilities. It aims to control the on-chip execution flow and the whole on-chip kernel execution wiil starts from CU.

TY-NNP takes an explicitly managed memory hierarchy, each computing unit has its own buffer and there is a global on-chip buffer (called DM) to transfer data between each unit. Data transfer is explicitly done by asynchronous DMA operations and explicit/implicit synchronizations are used to avoid hazards. In TVM, DMA and synchronization are also represented by TIR intrinsics.

An off-chip storage (called DDR) is managed to transfer data between host and device, which takes much larger space than on-chip buffers and supports dynamic memory allocations. In TVM the DDR storage just corresponds to the storage scope kGlobal and is managed by runtime.

Implementation design

The current TVM compilation stack for TY-NNP is as follows:

Relay level

  • We use a fusion pass based on a dedicated hardware cost model. Beyond traditional heuristic-based fusion for conv-bn-relu like patterns, it performs a much more aggressive strategy to merge multiple anchor ops like conv into a single device kernel. This brings opportunities to schedule multiple anchor ops simultaneously, which we think is essential to saturate our NPU hardware.
  • A schedule-aware layout rewrite mechanism is added. Our tir schedule phase would rewrite tensor layouts to fit hardware features, so we modify the compile engine to give a chance of compatible updates at relay level.

TIR level

A key difference from the current cpu/gpu design is that we try to schedule&tune blocks for multiple ops. It is ok to compute a single heavy op for a single kernel on a gpu device. But we think NPU may prefer to launch a block of consecutive ops to avoid frequent kernel launches. Thus, the proposed fusion pass described above is a way to achieve this.

Also, since the main efforts of tvm community are on cpu/gpu backend, there do exist pain points when developing tir supports for NPU fashion backend. We take some struggling to make it work through the standard schedule → lower flow.

  • We use TensorIR schedule ([RFC] TensorIR: A schedulable IR for TVM) to schedule the computations. This is the first trial of TensorIR schedule on NPU infrastructures as far as we know.
  • A set of new schedule primitives are added to utilize hardware features.
  • A set of new tir passes are added to utilize hardware features.
  • We use device_scope attr to mark the kernel part of the code. The community host-dev split mechanism works just well for us.

Target level

  • For codegen, we developed class CodeGenTYNNPLLVM: public CodeGenLLVM
  • For runtime, we developed class TYNNPDeviceAPI: public DeviceAPI

How to run

Dependencies

The TY-NNP backend depends on the following prebuilt binaries:

  1. LLVM libraries with TY-NNP target support
  2. TY-NNP assembler
  3. TY-NNP driver libraries with integrated simulator

They are available after upstreaming. Also, we are more than glad to provide Docker environments for anyone interested in our hardware.

Playing

All dependencies are integrated into codegen and runtime, so users can just use general interfaces in a normal way with only two extra cmake options.

# enable TY-NNP support in config.cmake
set(USE_TYNNP ${path to TY-NNP toolchains})
set(USE_LLVM ${path to llvm-config of TY-NNP target support})
# test from tir
with ty_nnp.build_config():  # customized pass context
    dev = tvm.ty_nnp(0)
    a = tvm.nd.array(a_np, dev) 
    b = tvm.nd.array(b_np, dev) 
    f = tvm.build(primfunc, target="ty-nnp")
    f(a, b)
# test from relay
with ty_nnp.build_config():  # customized pass context
    dev = tvm.ty_nnp(0)
    a = tvm.nd.array(a_np, dev)
    lib = tvm.build(relay_module, target="ty-nnp")
    m = graph_executor.GraphModule(lib["default"](dev))
    m.set_input(0, a)
    m.run()
    b = m.get_output(0)

CI Integration

Although we have managed full scenarios tests in our internal repositories, it would be great if some key features (eg, conv op) could get covered by community CIs. We could provide Docker images which enable the backend testing environments. Any detailed suggestions for CI integration are very welcome!

What we want to contribute

Currently, our backend codes lie in contrib of corresponding code directories:

  • c++: src/contrib/ty_nnp (except codegen/runtime)
  • python: python/tvm/contrib/ty_nnp
  • unittests: tests/python/contrib/ty_nnp

They can be summarized as following aspects:

TY-NNP codegen and runtime

Runtime is in src/runtime/contrib/ty_nnp and LLVM codegen is in src/target/ty_nnp

  • This will introduce a new device type kDLTYNNP and a new target name TY-NNP. The corresponding codegen/runtime codes are incremental and do not affect upstream source codes.
  • A set of new StorageRank enums have to be added to specify different on-chip buffer types. Ideally, we are glad to know the best way to define these target-related informations.

TIR optimizations on TY-NNP target

TIR codes are mainly in src/contrib/ty_nnp/tir

  • This will introduce a set of backend TIR passes for TY-NNP hardware features, such as DMA intrinsics, synchronizations, static address allocations and etc. They are designed for our hardware only. Users call ty_nnp.build_config() to get the specific pass context.
  • In tvm.build process, we introduce more flexible configurations, such as disabling standard passes which are incompatible with ours.

TensorIR schedule proposal

  • We would like to introduce a set of new schedule primitives

    • Imperative loop partition

      Users can either partition the loops and blocks at the schedule phase immediately or lazily perform it in loop_partition pass. It helps a lot in non-perfect tiling cases or where boundary conditions are not directly supported by the hardware.

      _, _, h_axis, w_axis, _, = s.get_loops(block)
      
      # imperative
      partitioned = s.loop_partition([h_axis, w_axis], lazy=False)
      # partitioned is a tree structured data structure tracing partitioned blocks
      my_visit(partitioned)
      
      # lazy, only hint tag added
      s.loop_partition([h_axis, w_axis], lazy=True)
      
    • Buffer/loop primitives duality

      TVM has already provided very convenient primitives for loops. However, it could be great to explicitly manage memory orders as well as computation orders. We believe for many NPU scenarios, it is very essential to control data layouts of on-chip memory buffers. TensorIR can control buffer dim alignment but it is not enough. On-chip buffers with locality to NPU specified function units (imagine TensorCore) can take totally different memory layouts. It also benefits any infrastructure with manageable memory hierarchies.

      Just like we get nested loops by get_loops(block), we make dualed designs to get buffer axes like get_write_buffer_axes(block, write_idx) and conduct buffer layout schedule on these axes. Below is a table listing for primitives duality, the highlighted are proposed new primitives:

      Loop schedule Buffer schedule
      get_loops get_write_buffer_axes, get_read_buffer_axes
      split buffer_split
      fuse buffer_fuse
      reorder buffer_reorder
      loop_extent_align buffer_dim_align
  • Accommodated scheduling and tuning. Mainly in python/tvm/contrib/ty_nnp/topi

    Currently the schedule/tuning logic is designed for our hardware features only. However, we are very interested in whether there are common methodologies of such NPU schedule designs. We would like to refine our codes to a more general schedule/tuning support into TensorIR modules if such opportunities exist!

Relay accommodation

Mainly in python/tvm/contrib/ty_nnp/relay and src/contrib/ty_nnp/relay

As described in the implementation design

  • Currently our fusion pass depends on hardware specified cost models. We’d like to refine our code to form an auto-fusion framework with third-party cost models if it is possible.
  • Schedule-aware layout rewrite transformation. We add a relay pass to perform a “pre-schedule” which determines the best data/weight layout, and then the pass can rewrite the relay level layouts according to the signature of primfunc. Currently, we have to hack the compile engine to find the pre-scheduled PrimFunc from a standalone cache, we are glad to know what is the best way to achieve this goal.
  • To utilize the scheduling described above, we propose to insert a customization point in compile engine, which could be different from the fallback schedule, auto-schedule and meta-schedule.
  • We add some customized relay ops such as sum_pool2d and etc, glad to add them as standard relay ops if they are generally useful.

Summary

  • We implemented TY-NNP runtime and codegen. They are introduced as standalone modules with USE_TYNNP compile option.
  • We integrate TensorIR (and corresponding relay adaptions) to perform schedule and optimization for our target. This will introduce some adaptations and new features to upstream codes. Perhaps we should split them into standalone PR/RFCs?

Thanks for all your attention, and any suggestions or comments would be appreciated. We are proud to contribute consistently as part of the community.

10 Likes

Thank you @wrongtest . It would be great to start with a base RFC that establishes the basic infra then followup RFCs.

If there is a change that touches some key data structures that(e.g. changes to TensorIR nodes) can affect other backends, separate RFCs would be appreciated since these would enjoy a broader discussion – TYNNP use-case can be used as motivating factor and there could be other related applications, or tradeoffs that affects existing backends that needs to be considered.

For code that are relatively isolated, follows the current architecture and specific to TYNNP(e.g. TensorIR code-gen for TYNNP, or a TYNNP specific pass/primitive that follows the same architecture and have no changes to the common data structures) , it is good to bundle them in a RFC and we certainly want to encourage quick adoption of new backends and primitives under the unified architecture.

thanks for the writeup @wrongtest! a couple points I am more curious about:

could you say more here? is this a Relay-level thing or a TIR thing? presuming you’ve implemented this as a pass, how do you plan to ensure that the Relay-level pass makes the same scheduling decision as the TIR pass?

it seems like this could either be integrated into ci-cpu or as a separate ci- image, so long as the binaries are publicly available. do you have an estimate of the size of the docker image? also, just for my curiosity, would you be able to share a rough timeline of when you’d like to land this?

Thanks for your comments:)

Perhaps I could take a fake example on Conv2d to describe it:

fn (%arg0: Tensor[(1, 32, 224, 224), int8], %nn.conv2d_arg: Tensor[(32, 3, 7, 7), int8]) {
  %conv_fn = fn (%data: Tensor[(1, 3, 224, 224), int8], %weight: Tensor[(32, 3, 7, 7), int8], Primitive=1) {
    nn.conv2d(%data, %weight, padding=[1, 1, 1, 1],  kernel_size=[7, 7], out_dtype="int32")
  };
  %conv_fn(%arg0, %nn.conv2d_arg)
}

and the coresponding PrimFunc for primitive call %conv_fn would be like

@T.prim_func
def main(x: T.Buffer[...], weight: T.Buffer[(32, 3, 7, 7), "int8"], y: T.Buffer[...]) -> None:
     # body

Assume to utilize the specific hardware, we want to arrange I/O channels into 4*4 tiles. There are extra two notes:

  • We get to know the “best” weight layout until a TIR schedule/tuning is done.
  • The required layout is out of scope of common representations like “OIHW”, “OHWI”, etc.

The TIR schedule part would do following transformation on weight:

o, i, h, w = s.get_read_buffer_axes(conv_block)
o_outer, o_inner = s.buffer_split(o, factor=4)  # [32, 3, 7, 7] -> [8, 4, 3, 7, 7]
i_outer, i_inner = s.buffer_split(i, factor=4)  # [8, 4, 3, 7, 7] -> [8, 4, 1, 4, 7, 7]
s.buffer_reorder(o_outer, o_inner, i_outer, i_inner, h, w)  #  [8, 4, 1, 4, 7, 7] -> [8, 1, 4, 4, 7, 7]

Above we use a set of extended TensorIR primitives, but they can just be seen as sugars of ongoing schedule primitive transform_layout:

The point is that they are not arbitary index remappings (compare to a general transform_layout). We ensure every such schedule step takes exact equivalent relay transformations.

In TIR schedule phase, we trace every buffer layout change on function param buffer (we can do that since they are what we implement), generate the transform (&& reverse transform) in relay on each step, and finally compose them into single layout transform (&& reverse transform) functions in relay.

For the used example, it would be:

  • s.buffer_split(o, factor=4)

    • x → relay.reshape(x, [-1, 4, 3, 7, 7])
    • (reverse) x → relay.reshape(x, [32, 3, 7, 7])
  • s.buffer_split(i, factor=4)

    • x → relay.reshape(relay.nn.pad(x, […, (0, 1), …]), [8, 4, -1, 4, 7, 7])
    • (reverse) x → relay.strided_slice(relay.reshape(x, [8, 4, 4, 7, 7]), begin=…, end=…)
  • s.buffer_reorder(...)

    • x → relay.transpose(x, […])
    • (reverse) x → relay.transpose(x, […])

Finally all transforms (&& reverse transforms) are composed into two relay.Function objects to rewrite relay-level layouts, which accepts original relay params, returns updated params tuple:

fn (%p0: Tensor[..., int8], %p1: Tensor[(32, 3, 7, 7), int8]) {
  %0 = reshape(%p1, newshape=[...]);
  %1 = nn.pad(%0, pad_width=[...]);
  %2 = reshape(%1, newshape=[...]);
  %3 = transpose(%2, axes=[...]);
  (%p0, %3)
}

and the reverse direction is:

fn (%p0: Tensor[..., int8], %p1: Tensor[(8, 4, 1, 4, 7, 7), int8]) {
  %0 = transpose(%p1, axes=[...]);
  %1 = reshape(%0, newshape=[...]);
  %2 = strided_slice(%1, begin=[...], end=[...], strides=[...]);
  %3 = reshape(%2, newshape=[32, 3, 7, 7]);
  (%p0, %3)
}

A relay pass now can perform “pre”-schedule for each primitive function, fetch the layout transform functions from schedule result, and perform relay-level layout updation. Finally, an extra FoldConstants could eliminate all extra transformations out of primitive calls typically.

 fn (%arg0: Tensor[(1, 32, 224, 224), int8], %nn.conv2d_arg: Tensor[(32, 3, 7, 7), int8]) {
  %0 = reshape(%nn.conv2d_arg, newshape=[...]);
  %1 = nn.pad(%0, pad_width=[...]);
  %2 = reshape(%1, newshape=[...]);
  %3 = transpose(%2, axes=[...]);
  %conv_fn = fn (%data: Tensor[(1, 3, 224, 224), int8], %weight: Tensor[(8, 4, 1, 4, 7, 7), int8], Primitive=1, DevicePrimFuncKey=873487) {
   %4 = transpose(%weight, axes=[...]);
   %5 = reshape(%4, newshape=[...]);
   %6 = strided_slice(%5, begin=[...], end=[...], strides=[...]);
   %7 = reshape(%6, newshape=[32, 3, 7, 7]); 
   nn.conv2d(%data, %7, padding=[1, 1, 1, 1], kernel_size=[7, 7], out_dtype="int32");
  };
  %conv_fn(%arg0, %3)
}

The actual params are transformed before call into %conv_fn and the formal params are reversed within %conv_fn's body. Why we need reverse transforms is that we currently can not represent a “lowered” function call in relay (correct me). It is a workaround for us to keep a valid primitive function body, that is, the relay module after pass can still be safely evaluated on a CPU.

All things described are only targeted to weights (free tensors) now. We check that a tensor produced/consumed by other relay calls should not get transformed. For input and output layouts, we find relay ConvertLayout can cover the currently demands. However, I think there is no essential difference between “appliable functions to transform layout” and a simple tag like “NCHW” on a input/output, it is possible to rewrite the input/output with the same machanism.

One remaining issue here is that we have to hack the CompileEngine(now TECompiler) to cache and reuse the previously scheduled PrimFuncs. Very glad to know if existing machanisms (like relay_to_tir?) can help us :slight_smile: cc @areusch

If a separate image is possible (it can be based on ci-cpu), we may prefer it since future upgration will not bother the ci-cpu's usages. The incremental file sizes would be as below:

  • LLVM: x86+device target build is about 140M, some libLLVM* maybe unused
  • Other toolchains: full toolchain binaries will occupy 500M, simulator only binaries can control down to <100M
  • Test models (like torch resnet50): if they are available in general CI environments, can be reused

I am sorry to fail to give an exact landing timeline now :frowning: . we are now working to make the open-source branch ready (along with dependencies) in the first quarter, but it depends on the progress of legalization on open-source issues.

1 Like

Hi @wrongtest, thanks for the nice write up.

Currently, we have to hack the compile engine to find the pre-scheduled PrimFunc from a standalone cache, we are glad to know what is the best way to achieve this goal.

Here’s some thoughts, please correct any misunderstandings I might have.

Yes, the relay_to_tir pass that @Mousius added a few months back could work, since it seems you want to completely take over scheduling. You can use annotations to convey layout hints from your analysis pass to lowering (or just do it on-the-fly in your hook, you’re probably doing a global analysis though?) You’d have to implement caching yourself, but when Chris and I were trying to decide if caching was something worth building into the relay_to_tir machinery the consensus was it was straightforward to just implement it directly in each hook function. So that gives you both full control over the conversion to TIR and full control over the rewritten call_lowered you leave behind. I think everyone would be happy to extend that if you find it lacking.

We’ve also been mulling over another approach to incremental layout optimization, though it’s by no means ready to use out-of-the-box (but maybe it sparks your interest?). We can now invoke lowering multiple times, and with a bit more work we could even restrict lowering to trigger on only particular ‘focus’ primitive functions. Ie we don’t have to lower all-at-once. We’ve also done some legwork to allow virtual device annotations to flow both into and out of already lowered PrimFuncs, all be it currently only for memory scope and not layout. But putting those together we could imagine:

  • allow layout constraints to appear in VirtualDevices, just as we now do for memory/storage scope.
  • choose a subset of ‘critical’ primitives (maybe just one) to lower, and give lowering free choices to choose the best layout. Capture that choice in the PrimFunc using VirtualDevices on the arguments.
  • re-run device planning to flow the new layout constraints to yet-to-be-lowered primitives. Where layouts have a hard disagreement insert the necessary layout x-forms as per the bijections you describe.
  • re-run lowering on the next set of ‘critical’ primitives, this time respecting any layout constraints already imposed on the arguments, but as before any still unconstrained arguments can have their layout chosen during lowering.
  • repeat until all primitives lowered.

Would be happy to talk more about that if you see a connection.

2 Likes

@mbs-octoml Hi~ Many thanks for your reply! Here are several questions of me:

  1. What does call_lowered mean? Does it mean we can put PrimFuncs and relay functions into the same IRModule and make calls to each other now?

  2. For the VirtualDevice, it would be the interface to keep all information we required across relay-tir boundary, is my understanding right? This would be a closed set (including device, mem scope, etc) or allow thirdparty extensions?

  3. Just out of my curiosity, what is the difference between ongoing Relax and current machanism described?

Hi again,

What does call_lowered mean?

Yes, it’s just the new convention for representing a call to a PrimFunc or extern inside a Relay Expr. Technically you can insert that into the original Relay program (eg see the unit test https://github.com/apache/tvm/blob/f3661f552be5772efdb451a66df23f150ab93d8a/tests/python/relay/test_pass_plan_devices.py#L1656) but more commonly you could perform your own lowering and leave behind @call_lowered to signal to the ‘default’ lowering machinery that there’s nothing more to be done for those calls.

This would be a closed set (including device, mem scope, etc) or allow thirdparty extensions?

Right now it’s a closed set, and though it’s not too hard to extend VirtualDevice to support new fields it’s also not just a matter of adding a field. I’d honestly never thought about that use case, is it worth thinking of some motivating examples?

Just out of my curiosity, what is the difference between ongoing Relax and current machanism described

@call_lowered is a back-port of Relax’s @call_tir to main so as to cleanup te_compiler.cc and downstream passes which can now operate both before and after lowering. Relax puts more emphasis on the ergonomics of writing Relay programs which mix-and-match Relay Functions and TIR PrimFuncs.

Hope that helps.

@wrongtest @mbs-octoml I think relay pre-sch is a better way to go. The lowering gap between relay and tir can be reduced via this way. We can also pre-sch for memory promition(cache read/write), plan the memory scope and size for relay layers globally, so we can do memory stitching for the whole model.

Passing hints to tir works, but it will bring more efforts for operators. We may have similar logical lowering or schedule code, which can be abstarcted away in relay pre-sch passes. Writing operators should be as simple as possible.

Yes, I agree wth that it is definitely more comfortable to make local lowering decisions and coordinate them on a global (eg, relay) level.

However, when there are conflictions between these local decisions, things would be non-trivial. I think that is the context related to what @mbs-octoml talked about, where multi-round & message-passing style updates are suggested to achieve a valid global decision.

So, at least for our workloads, as long as we have separate level of decisions (relay and tir), I think passing hints/constraints to tir is unavoidable. For example, we can not change the input/output layout arbitarily unless it is compatible with producer/consumer ops.