End-to-end (TVM+VTA) flow tutorial with Yolo v3

kevinyuan · July 30, 2019, 2:14pm

Dear tvm community members,

I want to learn the end-to-end flow with Yolo v3, which means not only porting darknet yolov3 model with tvm/relay, but also compiling the model into VTA micro-op instructions, run the model on VTA RTL simlulation with a given image, and finally get a output image with labled bounding boxes.

I am aware of this tutorial:
https://docs.tvm.ai/tutorials/frontend/from_darknet.html

But looks like it stops on tvm, and doesn’t talk about the flow regarding to VTA.

Would it be possible to complement this tutorial with end-to-end example flow ?

Thanks very much!

Kevin

thierry · July 30, 2019, 5:51pm

Hi Kevin,

Having a YOLOv3 out of the box demo would be nice. It would require a bit of footwork in terms of applying quantization correctly and inserting the right stop_fusion nodes to pattern match the operators that VTA supports (until we have a good pattern matcher). Once we have a set of operators offloadable to VTA, we can run autoTVM scheduling to have optimized inference on a variant of VTA.

If you want to give it a shot to have it run ASAP, I’d be happy to guide you through it.

cbalint13 · July 31, 2019, 1:36am

@kevinyuan, @thierry,

I would join you too.

Added the small variant yolov3-tiny too for the scope
Its small enough (E.x ~100ms on Mali GPU with float16), so could be more VTA friendly.

Derived from actual tutorial I can help implement/extend it to use video-frames(opencv) + quantize(tvm/int8) + tune(autotvm) (be back with a GIST proposal).

But then, I provoke @thierry @vegaluis to target it on ICE40 (5k) (with only yosys + nextpnr)

Let me prepare a GIST with the flow (demo on CPU), i’ll stop at VTA part then i’ll be back .

thierry · July 31, 2019, 1:55am

Nice! I think that tiny YOLO will give us a good starting point for real time object detection

Let me know how the compilation goes. I expect quantization pass might break, just let me know which roadblocks you’re running into! Also list the convolution operators that need tuning, I can spin some tuning jobs and update TOPHUB accordingly.

kevinyuan · July 31, 2019, 3:44am

@thierry, @cbalint13,

Very appreciate for you quick response and actions on this request.

I would like to contribute to this effort as well, but my experiences are mostly FPGA/ASIC front-end design, and I am quite new to Python/TVM/VTA/Chisel.

I think I can try to add some missing functions (e.g. operator offloading) into VTA with verilog or Chisel, while I certainly need your guidance how to define the missing functions and the detail specification of the function design that best fit into TVM/VTA architecture.

Please don’t be hesitate to let me know what and how I can contribute

Best regards.

cbalint13 · July 31, 2019, 4:31am

@thierry , @kevinyuan,

Allow 1-2 day to test/elaborate, i already done it partially just need to wrap things together. Will ask for help (from quantization folks to quick look), there are also pending PR for quantization.

Exaggerated with ice40 targets but one day we could conquer it see e.g. MARLANN. Would be also interesting in future to have low bit bitpacked operators for VTA, then it really could fit on small FPGA.

thierry · July 31, 2019, 7:16am

Working on bitpacked operators would be super interesting on VTA, this is a direction I’m looking into enabling in hardware/software, but a lot of work will need to be done on training/quantization to enable it.

In terms of FPGA coverage, are there low-power FPGAs that @cbalint13 and @kevinyuan would be interested in providing preliminary support to other than the ice40? I think it might be interesting to see if we can instantiate a VTA design on an FPGA ~10x smaller than originally designed for. We could come up with interesting optimizations, or re-organizations.

cbalint13 · July 31, 2019, 2:24pm

@thierry ,

Same random thoughts on fpga targets:

Low sized FPGA would be interesting as ultra low power applications like TinBiNN showcased by Lattice or MARLANN does it on ICE40. I am confident that at least state-of-art can be achieved in terms of smallness and low power consumption. These smaller devices also have the advantage to be syntesiable end-to-end with opensource tools too, so they can become very popular if not already are like this board upduino ,even Lattice support and showcase it as third-party board. The low-power target field is still poorly covered yet by the industry, there is lot of open room.
Also Lattice ECP5 (middle-low size) series are now supported by opensource community on boards like TinyFPGA-EX, and if i am not mistaken its showcased by company like XNOR.ai here as industry’s first low power target AI applications.
On high end stand alone FPGA (with some PCIe) from Xilinx7 family are also interesting especially the affordable ones like e.g. CrowdSupply. It is large enough to experiment and also synthethisable with opensource tools soon too. Such boards can be build even as DIY with ease, no very special requirements or pricetag.
True high end ones like Ultra+ became unaccessible for many people however those can deliver real state-of-art performances (but not so sure about when compared to ASIC competitors).

thierry · August 1, 2019, 6:58am

Thanks @cbalint13 for the suggestions. It would be great to have a contributor work on Lattice tool chains support. Recently, TVM reviewer @liangfu added support for Intel (formerly Altera) FPGA SoC support. We could perhaps pick a Lattice FPGA that has microcontroller support. Thoughts?

thierry · August 1, 2019, 6:59am

I also realize that we’ve diverted from the topic of the original thread, so feel free to add a new one.

cbalint13 · August 2, 2019, 8:45pm

@thierry, @kevinyuan,

Prepared an end-to-end demo script (on CPU) here that do:

takes yolov3-tiny (can be ‘yolov3’, ‘yolov2’ but not tested)
import it to via relay graph
quantize net using KL statistics (latest PR #3854)
tune the resulting network (optional, uncomment L348), with resume support
evaluate final inference time per single frame
run demo on this video in real time on the screen.

For now is CPU only, can be adapted to VTA (help needed).

Note that frame resizing, box & other graphic overlay at display time is at orders more time consuming than inference itself, but this is ment to be a demo/tutorial at all.

thierry · August 2, 2019, 9:39pm

Very neat; this will be a great starting point to target VTA. I’ll start to take a look at the operators so we can make sure that we have proper coverage on VTA.

What are you running the demo on?

cbalint13 · August 3, 2019, 12:08pm

@thierry, @kevinyuan,

Update the script to revision 4 (works better, also tested with ‘yolov3-tiny’ and ‘yolov3’ & ‘yolov2’).
Also for local CPU there exposed a generic tuning file for each layer (no AVX2, that would be much faster).
Except video file all downloads goes automatic in the script, useful if we want end-to-end tutorial.
It is possible to use a camera instead of video, i’ll add a cfg switch for this in next revision 5 (be back).

It hits ~100ms inference time on CPU (old IvyBridge), curios on VTA how it would do on various targets (de10, pynq, ultra96).
ATM don’t have any of mentioned board but would looking forward to add support for artix/kintex7 or smaller ecp5 (cpu-less) with softcore (e.g. it could be risc-v if it is the only way).

thierry · August 8, 2019, 5:46pm

Thank you @cbalint13; I’d like to try on the pynq and VTA. Will update you when I get something running.

hzhang · October 17, 2019, 3:44pm

I’m running this demo on mac and I found that libdarknet_mac2.0.so is missing on *https://github.com/dmlc/web-data/tree/master/darknet/lib

What I’m trying to do is cloning darknet repo from https://github.com/pjreddie/darknet , build the darknet project on mac to get libdarknet.so, and rename it to libdarknet_mac2.0.so to see if it can work in this tiny yolo v3 demo. If any of you have done this before, or have any suggestions on how to run this demo on mac, please advise.

Really appreciate your help!

hzhang · October 17, 2019, 4:47pm

I’m running into this issue link and I found that download libdarknet_mac2.0.so from libdarknet_mac2.0.so will solve this issue.

thierry · October 17, 2019, 7:05pm

Ah yes, that is an issue with the tutorial. Can you fix the download path and submit a fix in a PR?

hzhang · October 18, 2019, 12:28am

Sure. And also I found some other Mac compatibility issues in the darknet tutorial code. Let me fix all of them in one PR.

hzhang · October 22, 2019, 3:59pm

Hi thierry, I just created a PR regarding the path fix, but I didn’t find a place to assign the reviewers. Could you help me with this PR? Thanks!

PS. Please ignore the other compatibility issue I mentioned in the above post. These issues are in the tiny yolov3 quantization demo, not in from_darknet.py

hjiang · July 10, 2020, 7:30pm

Hi There, Just a update, now VTA can support Yolov3-tiny, here(https://tvm.apache.org/docs/vta/tutorials/frontend/deploy_detection.html) is YoloV3-Tiny for VTA tutorial.