[Deploy] Seek advice on deploying ConvNets to ARM Cortex-M CPU + VTA

Hi everybody,
If I want to deploy ConvNets to ARM Cortex-M CPU + VTA, to be specific, without operating system running on the device, what should I do?
I can create the bitstream of “Cortex-M4 + VTA” and flash it to FPGA, so that I have a real device. But I’m confused about how to offload the computation workload to VTA. Also I would like to compile model as much as possible statically and ahead of time.
Is there any case I can refer to? Any advice is welcome.