[μTVM] Deployment on GAP8 RISC-V platform

JosseVanDelm · November 17, 2020, 6:16pm

Hello everyone,

For my master’s thesis I’m doing research on deep learning compiler toolchains so future deep learning accelerators can be utilized to the fullest and can be easily integrated in e.g. IoT and ultra-low-power applications.

I’m currently trying to look into how easy it is to link big compiler toolchains (like tvm) to an accelerator platform that is currently being developed at my university. Currently only some very optimized yet very unportable solutions exist in this space. So coupling TVM to this platform would be highly beneficial.

Part of the platform being developed will be similar to Greenwaves Technologies’ GAP8 RISC-V platform. That’s why i’m currently trying to deploy uTVM on a GAPuino development board to see what’s already possible with uTVM. Like the platform being developed, the GAP8 is quite constrained in memory and does not allow for a big OS to be loaded. So as it is a true bare-metal device, uTVM seems a great fit.

However right now I’m struggling on where to start for this deployment on the GAP8. I’ve read the blogpost; I have an OpenOCD and an adapted GCC compiler in place for the platform . The problem is that now I need to do parts 3 and 4 that are written on the blogpost:

a specification containing the device’s memory layout and general architectural characteristics
a code snippet that prepares the device for function execution

I’ve checked out the tutorial video and the code from the blogpost. But it seems that here and there a few things have shifted in the code since then, as I was unable to load the tvm.micro.device module. This is also reported in this forum post EDIT: I also just found this forum post. Also the tutorials provided on the docs concerning uTVM don’t exactly mention how to do this step 3 and 4 of the blogpost.

So I was wondering If i could get some help on deploying uTVM on the GAPuino board. Is there a guide for deploying on new platforms somewhere that I have missed? Or maybe this deployment on a GAP8 has already been done? If anyone has some pointers, examples or experience with this that would be great!

Thank you very much!

areusch · November 19, 2020, 3:25am

Hi @JosseVanDelm! Thanks for your post. This is indeed a very interesting direction to take TVM/microTVM.

Since made the blog post we’ve been working to improve µTVM portability, and have made significant changes to the way µTVM launches code. See µTVM roadmap. We are just about finished with that.

I’m actually currently working on syncing the microtvm-blogpost-eval repo up to work with main. I wish I could give you simple instructions but the changes are fairly complex. I should have that finished this week, though some PRs needed are not yet merged into TVM.

To get GAPduino working, you’d need to be able to compile µTVM RPC server and run it using a UART. We have code that uses Zephyr to do this on a variety of targets (and some RISC-V, though not yet tested), but I don’t know that Zephyr supports GAPduino. You could take a look at how we did the Zephyr integration.

We don’t yet have good documentation for porting µTVM to new platforms. This is another thing I’m working on and hope to address soon. I’d point you at two things that show the new flow to see if you can get started:

zephyr_test, which shows a minimal example of testing a single µTVM function on a Zephyr board
test_crt which exercises just the RPC server using stdio as a UART replacement.

apologies if we are a bit light on documentation. feel free to ask more questions and i’ll try and answer as best I can, and i’ll let you know when documentation is improved (in the next few weeks, I think).

Andrew

areusch · November 19, 2020, 3:29am

i’d also point you at the work from NTHU to enable RISC-V P support. it’s not merged now, but it may be a useful reference as well.

JosseVanDelm · November 19, 2020, 3:29pm

Hi @areusch,

Thanks for your ellaborate reply. I’m a bit overwhelmed by all the changes actually. I had bumped into the roadmap you mentioned but i found it fairly difficult to comprehend with my little background and i did not expect uTVM to have changed so radically in so few months time already .

Having gone through the roadmap I have the following questions:

Do I understand correctly that you are trying to replace the simple openocd interface that needs the read/write/execute functionality with a C runtime and a minimal RPC server that connects through UART? Could you maybe ellaborate on the changes there? Why is this necessary? I suppose to benefit even more from what is already realized in the rest of the TVM stack?
Zephyr does not support GAP8. To be honest I’m not sure what such a low level OS actually provides. Do I need an OS with the current changes? Would this facilitate deployment? I’ve seen MBED-os being mentioned on both uTVM and GAP8 sides. Could this be an interesting approach?
With the current proposed changes, isn’t the overhead of running tvm on the device much higher than previously? How do i know an RPC server, a Runtime and an OS leave enough headspace for deploying useful neural networks on the device?
Yesterday I tried to go through the zephyr demo with a debugger, but the dependencies of the test were quite big and difficult to install on my machine. Do you have a proposed debugging strategy maybe? Maybe it’s easiest if I run it inside the CI docker? Or would that be difficult? Sadly I have no experience with this myself.

I’m sorry if I sound a bit sceptical. It’s just that, having read the earlier blog post, I thought that I could integrate uTVM in a couple of days and then automatically benefit of all the rest of the compiler stack. I have an intermediate presentation due in a couple of weeks and frankly I’m not so sure anymore if it’s worthwhile spending a lot of time on getting this to work for my thesis. Maybe you have an idea of how much work/time it would take me ?

Thank you very much for the great help you are providing me here! I really do appreciate the work everybody and especially you are doing so keep up the good work! I’m very curious to see where TVM and especially microTVM is going!

areusch · November 19, 2020, 6:24pm

hi @JosseVanDelm,

I agree there have been quite a few changes since the last blog post. We’ll give an updated overview at TVMconf in a couple of weeks’ time.

Do you need to run autotuning to start with, or just run inference? If the latter, you definitely don’t need to bother with any of the OS–I would just try to build with the c target and link the generated code and graph runtime into a binary for your platform. you could follow the build steps in test_crt.py and then export the generated code with mod.export_library() to produce a C file you can compile for your target.

From a time perspective–how practical is it to set up a UART or semihosting connection on your development board? The µTVM code is a bit new right now, so while we don’t want efforts like this to take long, we don’t have documentation sorted just yet for this. Happy to answer questions if you want to pursue this path.

i’ve included some more detailed answers to your questions below.

Andrew

Do I understand correctly that you are trying to replace the simple openocd interface that needs the read/write/execute functionality with a C runtime and a minimal RPC server that connects through UART? Could you maybe ellaborate on the changes there? Why is this necessary? I suppose to benefit even more from what is already realized in the rest of the TVM stack?

The main driver behind these changes is actually portability for autotuning. None of these changes affect the deployment requirements–µTVM does not assume the presence of an Operating System, and the runtime it requires are more like support functions around e.g. memory allocation, error reporting, etc (the TVMPlatform functions are the chip-dependent ones).

However, autotuning assumes that the target environment performs the same between runs, and on a bare-metal platform, the only reasonable way to do this is to fully control the set of instructions that execute between SoC reset and model execution. A major limitation of the previous approach was that you’d get different absolute timing numbers depending on which program was loaded in flash.

So to allow reproducible autotuning in a way that’s friendly to first-time users, we needed to choose a portable approach. This is why we’ve introduced the RPC server + Zephyr support. Now, it should be noted that we aren’t requiring you to use Zephyr–we want to make it possible to easily build the RPC server into whichever runtime environment you choose–just, in that case, you need to provide an implementation of the Compiler and Flasher classes.

Zephyr does not support GAP8. To be honest I’m not sure what such a low level OS actually provides. Do I need an OS with the current changes? Would this facilitate deployment? I’ve seen MBED-os being mentioned on both uTVM and GAP8 sides. Could this be an interesting approach?

You don’t need an OS, strictly speaking–you just need a small main() that can configure the SoC and launch the RPC server (for autotuning) or the graph runtime (for runtime inference). You’ll link different µTVM libraries into each binary (i.e. you’ll also link the RPC server library when autotuning). I have an mBED implementation of Compiler and Flasher here you could try, though it needs to be sync’d to main. This could be a good route for you if mBED is well-supported on that board. Or if it’s easier for you to write UART send/receive functions you could just do without an OS.

With the current proposed changes, isn’t the overhead of running tvm on the device much higher than previously? How do i know an RPC server, a Runtime and an OS leave enough headspace for deploying useful neural networks on the device?

There is an increase in the code overhead and a small increase in memory consumption for autotuning specifically. For deployment, the RPC server isn’t needed, and the OS would be whatever your project needs (if any), so we don’t see a large overhead there. For autotuning, you are typically loading just one operator at a time, so we think the impact should be limited.

Yesterday I tried to go through the zephyr demo with a debugger, but the dependencies of the test were quite big and difficult to install on my machine. Do you have a proposed debugging strategy maybe? Maybe it’s easiest if I run it inside the CI docker? Or would that be difficult? Sadly I have no experience with this myself.

We have a “Reference VM” that we just need to build and upload, and a tutorial that should be published but is missing like a Sphinx directive to stick it into the correct place in the doc tree. The VM contains all of the Zephyr deps you need, and is a little better way to do this than Docker since USB forwarding with Docker only works with libusb devices. You can try to build these boxes yourself using apps/microtvm/reference-vm/base-box-tool.py if you don’t want to wait on me to upload them.

Tonylyc · December 6, 2021, 8:55am

Since this topic is about 1 year ago，I’m still interesting in how is everything’s going. Did you successfully Deployed on GAP8 platform?

JosseVanDelm · December 6, 2021, 10:17am

Hi @Tonylyc,

Welcome to these forums and thanks for your question. I should clarify a bit the efforts in my thesis, because they have moved quite a bit from deploying on this development board. We are using a single core from GAP8 paired with a custom accelerator to deploy NN’s on. As stated in this post: Feedback on TVM port to custom accelerator

I have no experience with enabling multicore on the GAP8, but I’d think the best thing you can do there is to reuse some arm code from the regular TVM flow (no BYOC). I also don’t know how the multiple cores are addressed by the GAP8 SDK, and how well this integrates with TVM.

IIRC some people at University of Bologna were working on a library for efficiently executing (Q)NN layers on PULP-like architectures (main authors of https://arxiv.org/pdf/2008.07127.pdf) , maybe the easiest way to get started with GAP8 is to use BYOC and integrate the library that way?

You should know that a lot has changed since a year in microTVM. A lot of those changes will be discussed in the following TVMcon https://www.tvmcon.org/ I’d suggest you go and look there if you’d like to get an update.

Anyway, I’m currently not looking into deployment specifically on GAP8 anymore, but do let me know if you have any further questions!

Tonylyc · December 6, 2021, 3:45pm

@JosseVanDelm Thank you so much for replying me!

I’m a senior student and I got this thesis from my mentor. I’m a freshman to TVM and GAP8, so please don’t mind if questions I asked are too naive and stupid. Also you can see that my english is not so fluent, so please pardon me if something I expressed confusing you.

I did some research and I learned that the DORY tools mentioned in this paper https://arxiv.org/pdf/2008.07127.pdf need to be fed with a kind of IR generated by NEMO (another tool mentioned in DORY project). I also found a GAP-Flow produced by GreenWaves company.

It seems like if I wanted to use GAP-Flow or NEMO-DORY toolchain to compile and deploy my model, I need to build a pass which convert Relay or TIR into GAP-Flow’s or NEMO-DORY’s input format. I read BYOC’s document but I cant fully understand it , and I’m scared that I cant finish my thesis.

I just wonder does it possible for me to build a pass that convert TIR into GAP-Flow’s input format or maybe just native C code, then I can use AutoTiler tools in GAP-Flow or just GCC extended by GreenWaves to compile and deploy my model on GAP8 hardwares like AI-Deck. I know it maybe hard for you to give me a certain answer, I’m still very grateful for you reply above. I think the most possible way I can deploy my model on GAP8 is to write a pass which convert TIR into native C code, then use GCC to deploy it. My target is very low, if I can deploy 1~2 common model, the thesis will be finished as I think.

I’m so glad that someone had or has the same difficulties that I’m suffering. And If some advice can give to me will be super super nice, but still very great if not. Thank you again for your listening and help!

JosseVanDelm · December 6, 2021, 4:48pm

Hi @Tonylyc,

I should state that I have barely used GAP8, and I’ve never used DORY, nor GAP-flow. To me it is not clear what your goal is. Do you just want to run the models on GAP8? Do you want it to use the full multicore and ISA extensions? Do you want to use autotuning? What is the input format of your models?

Because if you’re not planning on using autotuning, you can use the C runtime to convert TIR to C code. It will create a library file with generated function calls, and a main file which will call those functions. Look at this tutorial to use the C runtime e.g. specify target as c

TARGET = tvm.target.target.Target('c -keys=arm_cpu -mcpu=cortex-m7 -link-params -model=stm32f746xx -runtime=c -system-lib=1')

You can hack the generated non-tuned kernels in your C code if you’d like.

I guess your problem is two-fold, but my information might be outdated (@areusch is this information (still) correct?):

On the one hand the GAP8 has no support for zephyr, which does a lot of the setup for you if you need to setup communication with the board etc (only required for autotuning). I’m not sure what efforts this entails of your board does not support zephyr or mBED, as I have no experience with this, but as Andrew states above here, It’s not impossible. If you don’t use autotuning the C runtime should fit your needs just fine, and you can just use GCC as you described.
On the other hand there seems to be a bit of a problem in the research community in general with quantization frameworks not having a common format to describe the models. I know that TVM supports PyTorch (like NEMO) and that TVM supports TFLite (Like GAP-flow) I’ve seen some quantization efforts from @electriclilies on this forum before, but I’m not sure what the status on quantization is right now in TVM, or what input formats it is able to consume.

Happy to help, Let me know if something’s not clear!

Best regards and good luck with your thesis!

areusch · December 6, 2021, 11:54pm

Hi @Tonylyc ,

It would be great to get some more context about what you want to do and we may be able to point you in the right direction. microTVM doesn’t currently support parallel execution but this is an interesting research direction and a direction we want to take the project next year.

Regarding the information @jossevandelm posted:

You might look at Project API, which I’ve just documented on our site now. This allows you to tell TVM how to build/flash/time code on your platform.

On this point I’m also not sure if there’s been much progress on quantization internal to TVM. However, I do know that for Ethos-U, there was a need to attach quantization parameters to their operators and pass those down through the compilation flow. They implemented a BYOC-style pattern matcher for this. You might take a look at that if you’re looking for an approach there.