Hello Andrew, @areusch.
Thanks for the very good feedback. Below, I am answering your questions and there are a few questions of my own. At the end I have tried to summarize the possible way for moving forward together.
Code Emitter
This approach is similar to some others posted to the forum before:
- ÂľTVM Static Code Generator 1 by @r.stahl
- my hack to do this 1
Did not see this post before. Same ideas, the API does not seem sufficiently elaborated. We would prefer to see such a tool bundled with the micro TVM instead of being a project apart.
In general, I think the direct-to-C++ route (as compared with the TIR route) is simple and easy to hack on, but the TIR route lends us more avenues for graph-level optimization.
I totally support this.
Does your approach handle workspace memory, allocated inside kernels (e.g. TVMBackendAllocWorkspace)?
In current implementation these allocations are placed on the heap (via a malloc). Work in progress is underway to redirect to a special section - similar to what you had in the older version of MicroTVM, the âworkareaâ, which can then be placed anywhere that the application wants via a linker script.
Could you say more about " it may be necessary that two models share their âactivation â pools?" Are these separate instances of the same model or two different models?
Two different models may be deployed simultaneaously in a target but do not necessarily run in parallel. In this case, one âactivationâ pool can be allocated instead of two (of course big enough to accomodate the larger of the two models).
On the other hand, two separate instances of the same model can share a single âactivationâ pool (built-in, for example), or the application can allocate two different âactivationâ pools, one per instance, if the two instances need to be run in parallel.
Firmware-facing API
This is an important point that need a clear understanding and convergeance.
TVM does have a standard object-oriented Module-based Model Runtime Interface 1 RFC
The Module based Model Runtime Interface discussion opens these questions:
D1: do you like the factory pattern, shall we always require a model name field (and allow âdefaultâ), or shall we take the alternative API specialization approach.
The main discussion point here is the application interface for deploying and using the packaged model. The packaging itself is well addressed by the Model Library Format RFC (see below). The factory pattern aims at minimize the API divergence for different deployment scenarios. The arguments for enforcing the generic factory pattern seem to be these:
- To have the same mechanism for packaging and loading.
- To let the users learn as little as possible.
From the two alternatives, we would prefer the API specialization for the micro TVM. In case of embedded ML there already exists an established API, such as the X-CUBE-AI or the TensorFlow Lite for Microcontrollers, the NXP tools expose a similar API as well; therefore aligning the micro TVM API to the GraphRuntime is less relevant since users are already familiar with these embedded APIs. Specializing micro TVM API also works well with the Project API concept.
This said, our C runtime can also go with the factory pattern. In particular, we have the âmodel descriptorsâ that can be âloadedâ at runtime and they carry all necessary âmetaâ-information from each model. Based on this, the factory pattern could be implemented. However, given that we are in C, not C++, this will be special in terms of the API and syntax, therefore does not seem to make sense.
D2: set/run/get interface and predict set interface is useful to allow users to set parameters during runtime. run is useful to do fine grained benchmarking predict is a more high level user friendly API, note that we still want to allow destination passing style(pass out) to allow more flexibility. predict forces us to enable runtime tuple support in the case of multiple output, while get_output keep things simple and minimum.
We prefer align on the current industry âstandardâ API.
@areusch, concerning the two points that you raised:
- PackedFunc are looked-up by string name. This is inefficient in terms of both memory and runtime. I think we still need to maintain that string lookup to keep compatibility with the RPC server implementation which drives autotuning. However, I wonder if we might consider making it a convention to implement PackedFunc with particular symbol names so that they could be called directly in production without string lookup.
If I understand right, the main application must be able to lookup operator functions via their string names. This can be implemented by providing an additional API method with the C runtime. Since it will be used with autotuning, we probably do not care as much for the performance of the string lookup and can allow the string compare, for example. Perhaps I did not get the point ?
- Arguments and return values need to be wrapped in TVMValue. I donât think we can get around this one, but we could implement wrappers to the firmware-facing executor functions to simplify this.
I am not sure I understand the issue. Can we elaborate ?
I wonder if there are other differences or critiques you could find of the C runtime that would improve it? It would be great to at least standardize the runtime between these two implementations. This would be in a follow-on RFC, though.
Summarizing my comments above, we would go for a specialized API for the micro TVM deployment. We would prefer alignment with the APIs used currently by the embedded industry over the alignment with the GraphRuntime API.
Code Emitter vs TIR-based approach
From our prospective, the TIR based implementation is preferable and when it is possible, we would like to move our code emitter there.
- Rework the PoC to consume Model Library Format and implement the Project API. Regarding the question of whether this should be applicable to autotuning or also to deployment: my thought was that this would be decided by the project API implementation (either create an option or a separate implementation for each scenario).
Agree, we are looking into this. See a few questions below in Testing and Code Location.
- When availableâuse the TIR-based comprehensive memory planner (it seems nearly identical to the one youâve implemented, and would generate JSON describing the memory pools).
We thought that the âstorage_idâ carried the results of the memory planner. Is there another machanism ? Agree on this point as well.
- Ensure at least the TVMBackend* functions are used from the C runtime, which provides a pathway to migrate to the TIR-based memory planner and avoids diverging too far in terms of generated code.
Tell me if this is what you meant ?
One important point from our implementation is that the memory is managed by
the application via whatever method the application may choose. The C
runtime does not perform any memory allocations (no TVMBackendAlloc
or TVMBackendFree
).
As it is, our runtime does not provide memory allocation methods but if there
is a reason to do that (some sort of TVM storage), it can be hooked to the
TVMBackend*
functions.
The C runtime does use the TVMbackendLastError
.
Finally, Iâd also propose we consider simplifying the C runtime API as discussed in Firmware-facing API section.
Are their particular simplification points that you have in mind ?
Testing and Code Location
Could you speak a bit more to how this code could be tested in the TVM CI?
Good question. The demo application cannot be tested in hardware without an available board. However, we can provide a sanity check for the generated C code and the runtime layer that can be built on the host (x86). This way, the code emitter and runtime will be tested, but not the on-the-board application.
As for the code location, the demo application is intended for the STM32 users
to start
on TVM (as a company we distribute the CubeMX solution whith eventually the TVM
integrated inside). A separate CubeMX projects will most probably also exist but I think it is
important to have a clear demo project in a spirit of the TVM (not hidden
inside the CubeMX tool).
We would go with apps/microtvm/no-ci
or apps/microtvm
with an x86 sanity
check CI.
We need to statuate on this. What is your preference ?
D1. Between this approach and a TIR-based AOT, do you guys have a preference which you would prefer to work with, assuming both were implemented?
Normally, the TIR-based AoT is to prefer. But, as I have mentioned in the post, we may end up with several AoTs for different targets. Would this be in line with what is intended in micro TVM ? How quick can we move onto this framework ? @giuseros
D2. While the Python APIs are perfectly fine, one goal of Model Library Format is to enable downstream tools such as this to work with TVM with less API drift. Do you guys prefer the Python API, or would this also be an interface youâd be open to consuming?
From what I understand, the Model Library Format is intended as a deployment format in TVM. So I see how it makes sense for the code emitter to generate the Model Library Format and transmit it to the external project. Of course, the code emitter could also consume the Model Library Format, but this seems less appropriate to us.
If we admit that the code emitter generates the Model Library Format, there are a couple of things that need to be clarified:
-
The Model Library Format looks like a draft proposal (correct me if I am wrong here). Do we have a more formal document describing the format? For eample, what are contents of the
runtime-config/aot
? -
The
host
vstarget-key
: I imagine that in the STM32 case, the generated sources, thenetwork.c
, theoperators.c
go to thehost/src
directory, right ? We also generate thenetwork_data.c
with params. Iâd propose to place this with thehost/src
sources as well. -
The generated C code targets a standalone runtime API, which is different compared to the TVM-built GraphRuntime API from the
crt
. Should we populate the âcrtâ with the standalone C runtime code instead ? Minor point: the Makefile is not generated by the standalone code emitter since included from external project.
D3. In general, the challenge with checking code such as this into the TVM repo is testing. Particularly with bare-metal code, itâs hard to test without hardware in the loop, and the TVM CI doesnât really have a provision for that now. Do you guys have a proposal how we might test this code?
As I explained earlier, we will put in place a minimal sanity testing of the generated C model and its runtime on the CI host.
In addition, we work with the Linaro foundation and they have a farm of HW boards they use for their CI. Linaro are also looking into the micro TVM and it seems reasonable to try finding a common ground where the TVM could use the Linaro infrastructure for the micro TVM development. I am adding @vinceab, our Linaro person to this thread.
Summary
In order to move forward, letâs statuate on these points:
-
Do we agree on our position for the C runtime API ? Any particular points on C runtime API simplification/additions/improvements.
-
We need to understand this point:
Ensure at least the TVMBackend* functions are used from the C runtime âŚ
- We need to understand the memory planner point:
When availableâuse the TIR-based comprehensive memory planner (it seems nearly identical to the one youâve implemented, and would generate JSON describing the memory pools).