Hello Andrew @areusch
In my mind, some setup function is needed to accomplish:
- initializing memory set aside for tensors and parameters
- configuring accelerators, including starting (possibly) backgrounded transfers of any programming/parameters.
I think that the TVM function for this is the factory function (right now, typically mod"default"), and the X-Cube equivalent is ai_[<model_name>_]create. Does that match your understanding?
This is exact.
Apologies, I think I was a bit confused before. IIUC, I think this port aims to implement an API aligned with the X-Cube API, at least for now only aiming to enable deployments to STM32–does that also seem right to you? I’m curious whether this API aims to replace the C runtime and Model-based Module Runtime Interface for all targets or if this would just be confined to STM32 for now.
If I am ambitious, I would say replace for a family of embedded targets. Sorry, I perhaps, was not clear earlier. We have observed several embedded tools converged on such API:
- X-CUBE-AI, of course
- TensorFlow Lite for Microcontrollers
- NXP eIQ-GLOW AOT NN Compiler
That seems a good argument to try also aligning the TVM C API in this direction.
We probably need to change the naming, perhaps have tvm_ai_
instead of
just ai_
- this is a detail. Important point is that there is a dozen of
methods common to the above APIs and that the memory management is left to
the main application to handle.
I propose to start with the STM32 code emitter now and work together with the TIR-based AoT
on converging to a common understanding. This will pave the way for us to move
to the TIR-based code generator. We can perhaps also contribute to its development.
Then the next questions I have would be around how you’d like to proceed with this going forward. At present, the STM32 generator PR you’ve proposed has several features that are missing from the microTVM compiler (e.g. memory pinning, AOT, etc). As we implement these features, will it be possible to incorporate them into this generator as well (I.e. to take advantage of compiler-level improvements we might be able to make, such as graph-level optimization)?
This would be the plan. I can imagine a couple of things we can do with the TIR-based AoT that we cannot with our current code emitter.
If so, it would be great to keep the STM32 API semantically similar to the TVM C runtime API, so that we can later invoke TVM C runtime APIs from the STM32 functions. I suspect these are pretty similar, but just want to understand the goals for code-reviewing your PR. One possible scenario is: when we have a TVM AOT runtime and memory pinning available, we could rework ai_create to instantiate the TVM C AOT runtime. It would also be great to use the STM32 API as inspiration to expand the TVM APIs to provide equivalent functionality. Please let me know your thoughts here!
Corresponds entirely to our vision. Great !
So my question here is: in the future, woudl you be open to using a TVM-side implementation of a memory-pool, statically-allocated memory planner? I think it sounds like that’d be okay, but just confirming.
Yes. We will move away from the JSON graph and base the code emission on the TIR-based TVM structures, including the memory planner.
When we do tensor pinning, I think it’s likely I’ll propose to add some tensor_id (note: different from storage_id, as storage_id could contain multiple tensor_id) to TVMBackendAllocWorkspace, and a lookup table could just return a pointer into the pre-allocated memory pool. TVMBackendFreeWorkspace would become a no-op. Will that work for you guys?
That is good. Just keep in mind that these memory pools should be open to a static allocation as a section via a link script, to a static allocation as a table from the main application (.data), and to the dynamic allocation via whatever allocator the application may choose.
- consider removing the need to use PackedFunc looked-up by string name, and instead provide more natural C wrappers around those functions
Already the case.
- consider creating a mapping from PackedFunc string name to a global symbol name to shortcut this lookup, as they won’t likely be dynamically overridden in embedded applications.
We will add a API method for such lookup implementing the mapping.
Would it be possible to checkin a docker container e.g. tlcpack/ci-stm32 which could run this in our CI? Then we can just make it a first-class example and place in apps/microtvm/stm32 or a similar sub-directory of microtvm of your choosing.
Yes. Noted.
The Module Library Format seems not fully finalized yet That’s fine. I will generate the structure as per your RFC proposal (no crt), and we can refine it from there. This is a minor detail.
Actions for us:
Re-submit the PR with this:
- Move to generating Module Library Format (as it is for now).
- Provide the docker and a test application for the sanity CI.
- Move to Project API on the demo side (structure +
microtvm_api_server.py
) implementing the Standalone Demo Project Generator based on your PoC.
We continue discussion on the C runtime API, how to involve the AoT people ? We can contribute to the development if necessary.
Does this work for you ?
Cheers
Arthur