[µTVM] Static Runtime Code Generator

Hi,

with this, I wanted to share our tool and gather some feedback. It is about the deployment of TinyML models on tiny microcontrollers.

Link: GitHub - tum-ei-eda/utvm_staticrt_codegen: This project contains a code generator that produces a replacement for other µTVM runtimes. This runtime statically executes a compiled model which reduces the overhead of having such a runtime in terms of code size and execution time.

Motivation

For tiny microcontrollers, we did not find a backend for TVM that provides a suitable deployment method. The closest candidate we found would be the standalone C runtime, but this one still spends a lot of computation time and code size on JSON parsing and requires dynamic memory allocation.

As far as I understand the long-term goal to solve this issue is to develop an “Ahead-of-time compiler” that, at its core, does basically the same as my tool, but it shall be more integrated into the whole TVM flow. However, until it is ready, we believe our solution can work as a substitute.

Description

The tool requires the outputs from the relay.build command (graph.json, params.bin, kernels.c) and generates a C source file that is able to statically execute the model without any additional TVM runtime. The final deployable code will therefore just be the optimized kernels.c from TVM, the generated calling code that executes these kernels in the correct order, and some top-level code to use the model. This makes the deployment very efficient in terms of computation time and memory usage.

Results

The complete flow was deployed and simulated with a few models on RISC-V “rv32gc” with our simulator ETISS

Variants:

  • TFLMCompiler: Our “TensorFlow Lite for Microcontrollers” flow with a similar approach for code generation of static inference code that avoids the TFLM interpreter (Link limit reached, it is on GitHub: cpetig/tflite_micro_compiler).
  • µTVM: The default µTVM flow, with some added automation for the deployment, see examples/codegen.py.
  • TVMCodeGen: This tool.

We can see that TVM produces much better kernels than TFLite Micro, but for very small models, the runtime overheads hurts quite a bit. With our code generator, the overhead is eliminated and we get the best numbers that we could produce so far.

4 Likes

hi @r.stahl, thanks for posting this up! this is a great tool and we are indeed working to merge similar functionality through our AOT efforts into TVM. we’ve decided to take an approach there that allows us to better leverage whole-graph optimization in the medium to long term, but the runtime-side result will be similar to what you’ve posted here.

do you guys intend to do any additional evaluation leveraging the RV32 P extension?

-andrew

I have also noticed some room for improvement in terms of buffer management to improve memory usage further. I’m looking forward to your efforts there and watching closely.

We have been looking into P (and V) extensions for the TFLM flow and are definitely interested in it for the TVM flow as well. However no efforts have been started yet. I initially had the impression that this would be a pretty simple task in TVM, but it seems like there are some issues with µTVM since the workaround to disable vectorization is still active. The LLVM backend might be a good option here, but not sure about the status of P codegen. If you have some insights or starting points we’d be happy to know more!

Hi @r.stahl ,

Interesting findings! May I ask what is the difference/relationship here in RAM and Stack ?

Hi @manupa-arm

for the simulation, a large enough stack (4K) was reserved. Through memory tracing, the maximum usage was recorded. This number is reported and included in the other RAM numbers (data sections and unused heap).

So for example for the sine_model the detailed report looks like this:

=== Results ===
ROM usage:        5.9 kB (0x171c)
  read-only data: 1.5 kB (0x608)
  code:           4.2 kB (0x1084)
  other required: 144 Bytes (0x90)
RAM usage:        2.3 kB (0x924)
  data:           1.1 kB (0x444)
  zero-init data: 132 Bytes (0x84)
  stack:          1.1 kB (0x45c)
  heap:           0 Bytes (0x0)

The full logic can be found here: etiss/get_metrics.py at master · tum-ei-eda/etiss · GitHub

1 Like

@r.stahl You’re right we do need to address disable_vectorization with the c codegen. Another challenge with using intrinsics with the C codegen is handling C inline assembly in a portable way. The LLVM codegen is a good starting point to avoid these issues, but I believe you need to build TVM against a custom LLVM right now. I think the NTHU post may have some more links to information about that as well as P-aware operator schedules which may potentially be useful.

adding @yrchen here in case he has ideas related to RISC-V P extension

Hi @r.stahl, great work! Just a small clarification question: Is the “standalone C runtime” you mentioned same as the tvm/apps/bundle_deploy at main · apache/tvm · GitHub example? If not, then how does your tool compare to the bundle_deploy flow in terms performance and code size?

Thanks, @areusch ! I was not aware of that post.

Hi @vaibhav . Yes, the runtime is the same as with the bundle_deploy example. The bundle_static.c is linked in and wrapped by the generated code from examples/codegen.py.

I had to do some trial-and-error to figure out the required CRT_MEMORY_NUM_PAGES. I picked the first working power of two.

Can you please share how you collected cycles and memory data for microTVM ?

Thanks

Hi @aakah18151

currently ETISS hat a simple processor model that counts one cycle per instruction.

I shared details about the memory profiling in my previous reply: [µTVM] Static Runtime Code Generator - #5 by r.stahl

Hi @areusch, this sounds intriguing. Where can I find more information on the current status of the AOT? Is there already an RFC?

haven’t posted one yet–we are doing some prerequisite work right now but will try to post it in the next week or two.