@giuseros @tqchen @kparzysz
Lots to catch up on here, thanks for the great discussions!
@giuseros:
I agree with @areusch about having a crt_backend_api.c
minimal and a rcp_backend_api.c
that adds more functionality.
Just to clarify: I think that the TVMBackend functions are strictly meant to be called from generated code, and I think that all generated code (be it RPC or otherwise) should use the same backend API (however, it’s possible some functions may be left unimplemented depending on the compilation settings). I think it’s possible we might have multiple implementations, but I’d expect that we would need just one for both Graph and AOT executor on micro.
I do think that crt_runtime_api.c
is mostly specific to RPC-driven execution, and will likely be moved almost wholesale out of the backend library.
From my POV this is the real runtime of our compiler.
This is actually kinda true–crt_backend_api.h
should define the “runtime.” Right now due to the memory allocation though, it spills over into platform.h
and other libraries of CRT. Perhaps we can find a way to consolidate these so the organization makes more sense.
Our approach would be to reduce, at least for AoT, this API to a minimum set of functions
- Memory allocation (for now)
- Parallel execution
- What about errors? For now the error API (
setLastError
, getLastError
) is part of the >c_runtime_api
, but the setter should be part of the backend API and the getter of the runtime_api
.
So I agree with this, and I think that crt_backend_api.h
should define anything needed here. The implementation of that, for the C runtime, should eventually be possible to do on terms favorable to embedded development (e.g. memory management should be something we can handle without dynamic memory).
Errors: currently convention is to return an int32_t from all PackedFunc if an error occurs. On microTVM, I have slightly abused this to allow us to return tvm_crt_error_t from TVMBackend functions (e.g. TVMBackendAllocWorkspace). In my experience, the more detail an error can provide, the better off you are being able to debug it. So kind of, I think we should adopt somewhere explicitly the convention that 0 is success and non-zero is a runtime-specific error code; this would then enshrine our ability to continue doing this.
Errors in embedded systems can take weeks to reproduce and may be caught only through logging systems built to store records of them in production. My strong opinion is that we cannot presume that the debugger will be attached at the time an error occurs, and must export enough information to allow developers to act accordingly. In particular, a stacktrace is not guaranteed to be available.
-
TVMValue
is a int64
union, and most embedded devices will struggle to deal with int64
.
I am not so sure about this, @giuseros. I agree that int64_t are difficult to deal with on an 32-bit embedded system, but for the most part we are not passing int64_t to PackedFunc. We are passing void* v_handle. Would that not be dealt with by writing the upper word to 0 and just issuing a store-word instruction to the lower word?
I am happy to be proven wrong here–I stand by my earlier assertion of not depending on the compiler :). Just my feeling is that if we are doing lots of int64_t math, that is one thing, but stuffing the occasional parameter into int64_t at function call boundaries doesn’t seem particularly slow considering the remainder of the function bodies.
One area I think we could obviously improve is TVMBackendAllocWorkspace (e.g. replace the size with size_t), but also it’s possible we will just move away from that function for micro/AOT, so it may just be a non-issue for us? Given it would be a cross-runtime change, I sort of prefer to see how this shakes out first.
Function call signatures
@kparzysz
When you talk about firmware, are you thinking about firmware calling the graph runner? If such calls are infrequent, CPackedFunc
should not be that much of a burden. I’d like to stick with CPackedFunc
, because we already use it.
Yes, though in this case it could be either graph executor (neé runner) or AOT executor. The current plan of record for TVM is that all executors are to implement the Module-based Model Runtime Interface. We can consider changes to that interface, but that’s a separate RFC. I don’t believe there’s any reason why we can’t define a single interface which both AOT and Graph executors can implement. Implementing to a single interface gives significant benefits in the RPC and non-micro use cases of AOT.
I’ll address CPackedFunc vs another below:
@areusch @tqchen I guess that we can add a TVMBackendPackedCFunc
wrapper function if the RPC side of the things need it. But is there any reason for not having the low level function written in plain C without typeid
s?
@kparzysz Here I don’t think there is a huge runtime burden on the CPU to use TVMBackendPackedCFunc
as the firmware-facing API. The burden is more on the developer–for instance, we have currently packed_func.h
checked-in as utility functions to help with PackedFunc calls. But all of this is a lot to swallow when you could just be calling standard C functions.
It’s also difficult to document. One thing that really annoys me about the interface exported from libtvm_runtime.so
and libtvm.so
is that we actually just don’t document it. Why? Because there is no doxygen for PackedFunc. Yet, that is supposed to be the core TVM interface. In this situation, we actually wrote Python wrappers which also add convention (example) on top of the PackedFunc interface.
So while I understand it is a common calling convention in TVM, I don’t view what we have today for PackedFunc as sufficient to be a developer-facing interface. Adding a C shim layer is the same thing we do in Python today. I think we need significant improvements to our documentation generator tooling if we want to stick with a pure PackedFunc developer-facing API.
Actually, I think the best way to move forward would be to sketch something and progressively agree on how it looks. Have you got any suggestions on how to do this sort of “sketching”? Maybe a draft PR not meant to be merged but only to spark discussion?
Let’s spin the firmware-facing API into a separate RFC/PoC PR. @giuseros, do you want to propose something?
-
TVMValue
s need to be packed/unpacked every time for every operator call
On the question of how should TVM internally call operator functions: I don’t have an opinion either way. I think that TQ has done some experimentation to show that with recent gcc and clang, the compiler does a lot to optimize out PackedFunc calls. However, I want to point out that on most embedded projects, developers have no say in the compiler they use, either because switching compilers would leave them without support from the SoC vendor for the hardware abstraction libraries, or because project timelines do not allow for the work needed to switch compilers. So I stick with my position that we should not expect anything from the compiler.
However, I think any such implementation would need to be done in the TIR lowering/codegen phase, rather than treating the calling convention as a codegen property. Keeping this implementation in TIR allows us to remove the PackedFunc overhead when it is meaningful to do so; while maintaining the ability to link with arbitrary PackedFunc should that be necessary, and maintaining the ability to wrap an arbitrary operator implementation in PackedFunc.
On this topic, @tqchen says:
X1 won’t provide any mechanism to detect such mismatch during runtime(if debug is enabled) while X0 allows us to provide type checking to do so.
In the C runtime case, I would posit that we are almost always building a static binary, so we can rely on the compiler’s type-checker to solve this problem and the linker to resolve references to functions.
→ PackedFunc lookup
Finally, a somewhat related topic is PackedFunc lookup. As we discussed before, we’ll likely replace string lookup with a function mangling scheme. This is yet another departure from the current implementation of tir.call_packed_lowered
. In this modification, we avoid calling TVMBackendGetFuncFromEnv
and instead just refer to the PackedFunc by mangled name. It seems that implementing this, we could stick with tir.call_packed_lowered
, but create an attribute to indicate the choice between TVMBackendGetFuncFromEnv
and the mangled function name.
Next steps
I’ve replied to the PR thread on my opinion how best to push forward on the existing AOT PR. That would be: merging the TIR top-level function plus anything strictly needed to test that, and leaving out pieces such as memory management and calling convention changes. I am not proposing we drop that work; instead I propose we keep discussing either here or as separate RFC, and implement those in follow-on PRs.
If we are in agreement with those next steps for the PR, I see these follow-on discussions:
- Memory management: how best to implement embedded-friendly memory planning in place of GraphPlanMemory
- Calling convention: Shall we propose a compilation mode for use with AOT which selectively drops PackedFunc for internal operator implementations?
- Firmware-facing API: Which interface shall we expose to the firmware developer, and how? How should this impact the Module-based Model Runtime Interface?
- Initialization flow: How shall we initialize the SoC (and any hardware needed)? Which hooks shall we provide from the executor to integrate with SoC?
I think each of these should be its own RFC in a follow-on thread. Currently, my time’s occupied on the Project API implementation–would anyone here like to spearhead any of the RFCs? If so, could you reply here indicating so, and after posting the follow-on RFC, link it from this thread.
Thanks!
Andrew