[µTVM] Simplifying the Compiler interface

areusch · December 16, 2020, 6:41pm

This thread is an embryonic RFC spun out from here.

Problem description

One observation many of us have made is that the µTVM tvm.micro.Compiler interface is a fairly tight integration between TVM and the compiler. The interface is here:

class Compiler(metaclass=abc.ABCMeta):
    """The compiler abstraction used with micro TVM."""

    @abc.abstractmethod
    def library(self, output, sources, options=None):
        """Build a library from the given source files.
        Parameters
        ----------
        output : str
            The path to the library that should be created. The containing directory
            is guaranteed to be empty and should be the base_dir for the returned
            Artifact.
        sources : List[str]
            A list of paths to source files that should be compiled.
        options : Optional[List[str]]
            If given, additional command-line flags to pass to the compiler.
        Returns
        -------
        MicroLibrary :
            The compiled library, as a MicroLibrary instance.
        """
        raise NotImplementedError()

    @abc.abstractmethod
    def binary(self, output, objects, options=None, link_main=True, main_options=None):
        """Link a binary from the given object and/or source files.
        Parameters
        ----------
        output : str
            The path to the binary that should be created. The containing directory
            is guaranteed to be empty and should be the base_dir for the returned
            Artifact.
        objects : List[MicroLibrary]
            A list of paths to source files or libraries that should be compiled. The final binary
            should be statically-linked.
        options: Optional[List[str]]
            If given, additional command-line flags to pass to the compiler.
        link_main: Optional[bool]
            True if the standard main entry point for this Compiler should be included in the
            binary. False if a main entry point is provided in one of `objects`.
        main_options: Optional[List[str]]
            If given, additional command-line flags to pass to the compiler when compiling the
            main() library. In some cases, the main() may be compiled directly into the final binary
            along with `objects` for logistical reasons. In those cases, specifying main_options is
            an error and ValueError will be raised.
        Returns
        -------
        MicroBinary :
            The compiled binary, as a MicroBinary instance.
        """
        raise NotImplementedError()

With mBED (implementation this was originally abstracted from), compilation speed wasn’t so much of an issue, but with Zephyr, this interface has contributed to slow compilation for all uses except autotuning. In particular, some observations:

O1. Given the interface, implementations must set up a clean build environment for each library generated. Depending on the underlying build system (e.g. cmake), this process can take longer than it does to compile the library.

O2. microTVM projects are generally small, especially when compilation is being driven from TVM. Meanwhile, this interface is quite complex and requires TVM to implement abstractions that may or may not work with each project around managing generated code. It may not be worth adding this complexity to TVM.

O3. It’s hard to save generated code to disk right now. If a user wanted to integrate their generated operator code into a project, that’s the next thing they need. Currently, the traditional TVM Module API export_library tries to build a shared object, and we have a function tvm.micro.build_static_runtime which compiles generated code using a TVM-configured compiler. We should fix this, so that it’s possible for TVM to just generate code and get out of the way.

Possible solution

The rest of TVM uses a function fcompile to handle compilation. I think in retrospect, this level of abstraction is about right for the compiler. However, µTVM needs additional things from the build system:

it needs to flash the built binary onto a target device
it needs to open a transport (I.e. serial port or socket) to communicate with the device

Finally, there are use cases where only these two things are needed. For example, if you want to debug a problem happening on the microcontroller, you likely want to run with the same firmware image several times.

Therefore, one possible solution is to abstract each of these into a project-centric interface. Here, a project is defined to a standalone set of files that can be compiled into a firmware binary and flashed to a target device. Here is a strawman interface:

class Project(metaclass=abc.abstractclass):

  @classmethod
  def from_runtime_module(cls, module, workspace) -> Project:
    """When supported, build a new Project in workspace from the generated code in module."""
    raise NotImplementedError()

  def compile(self) -> MicroBinary:
    """Compile the project and produce a MicroBinary."""
    raise NotImplementedError()

  def flasher(self, **kw) -> Flasher:
    """Return a Flasher instance that works with this project.

    Parameters
    ----------
    kw :
        Keyword args that configure flashing or select which board to use.
    """

  def flash(self, **kw) -> Transport:
    """Flash the compiled binary in this project's build tree onto the device.

    Returns
    -------
    Transport :
        A transport channel that can be used to communicate with the RPC server.
    """
    raise NotImplementedError()

Here I’m keeping the MicroBinary so we can attempt to support the following flow:

   +-----------------+                   +-----------------+                +-------+
   | compile machine | --[ TVM RPC ]-->  | flasher machine | ---[ USB ]---> | board |
   +-----------------+                   +-----------------+                +-------+

Next steps

If you have comments or an alternate proposal, please add them here. I’m hoping to work on this in the first few months of 2021.

manupa-arm · December 16, 2020, 9:32pm

Hi @areusch,

Thanks for the RFC. It is indeed an interesting discussion to have.

Few questions regarding the requirements of the flasher and transport.

So from the perspective of deployment flow, are these truly required ?

Here, I am assuming a non-tuning approach that uses default schedules for the operators and where tvm produces sources/object files for a given model. As for the debugging, that would be debugging of operator codegen of tvm and is it an expectation for the user to debug them ?

I could understand for the tuning process, how these become relevant. However, I think we need to discuss the possibility of a tuner app which could hold the RPC server, flasher, transport (maybe this might need client side code but will not depend on the RTOS) and a light-weight linker could also be an solution. This way we the tvm could generate RTOS-agnostic artifacts to be used by any other RTOS project (while the tuner also being such a project).

Thoughts ? @ramana-arm @mjs @Leo-arm @grant-arm @tgall_foo @tqchen @liangfu @weberlo

areusch · December 16, 2020, 10:32pm

@manupa-arm

from the perspective of deployment flow, are these truly required ?

first let’s clarify deployment flow. so far we’ve demonstrated:

standalone deployment, graph runtime on device.
deployment with an RPC server and host-driven graph runtime

ultimately I presume deployment means #1. from that perspective, you absolutely don’t need a flasher or transport. these are just useful for case #2 and for autotuning.

As for the debugging, that would be debugging of operator codegen of tvm and is it an expectation for the user to debug them ?

certainly we would hope this would not be the typical use case, but we should design a debug flow that works.

I think we need to discuss the possibility of a tuner app

could you elaborate more on this? i think we should avoid linking outside of gcc if we can help it, but if there’s good motivation we should discuss it.

Leo-arm · December 17, 2020, 4:03pm

There are a few concerns with regard to building and running an iteration of the tuning flow. One is that it tends to burn through boards as flash wears out, another is that it takes time to build and then flash the boards for each iteration. We might consider an application that is flashed onto the board that consists of rpc server and simple dynamic load mechanism to run the variant code from ram. We don’t think we can run only from ram because it would not take flash and system bus behaviour into account, and the model might not fit into ram anyway.

Why do you think we should avoid linking outside of gcc?

areusch · December 17, 2020, 5:36pm

@leo-arm thanks for your reply! I agree we should find a way to avoid requiring flashing on each AutoTVM iteration. let’s think this through a bit more on this thread since it would impact one of the core use cases of a Project abstraction. I think ultimately we should propose the full design on a separate RFC.

first let’s summarize the concerns:

C1. we cannot erase the flash too much

C2. we need to configure the system to match production, performance-wise.

C3. the location of parameter tensors could impact performance measurements.

in service of C2, some component needs to live at the Reset vector in flash, and we should also ensure we control all IRQ handlers, particularly the unimplemented ones. Currently, the Zephyr runtime/main() startup code handles this, and I don’t think it needs to change for this proposal.

to work around C1, we could consider a solution like:

Presume we have already flashed a minimal µTVM C runtime with RPC server but no compiled operators.
either the runtime (via a PackedFunc RPC call) or the Project implementation provides additional information about the availability of RAM to execute code.
When RAM is available, the Project implementation provides a function e.g. CompileToRAM which compiles the generated operator code with a modified linker script and places it in RAM.
Additionally, the Project implementation needs to include a small shim library which defines trampoline functions for TVMBackend implementations (operator implementations can perfectly legally depend on these). The shim library also defines a global pointer, _tvm_backend_functions of type struct TVMBackendLinkTable*. Here is an example:
```
struct TVMBackendLinkTable {
    int (*TVMBackendAllocWorkspace)(int device_type, int device_id, uint64_t nbytes,
                                    int dtype_code_hint, int dtype_bits_hint);
    int (*TVMBackendFreeWorkspace)(int device_type, int device_id, void* ptr);
    // Additional TVMBackend functions...
}
```
The device is reset and a new transport is opened.
Using a new upload_and_link RPC call, the compiled code is sent to the TVM C runtime. The following information is sent:
- The start address of the code
- The size of the code, in bytes
- The address of _tvm_backend_functions.
- The address of the TVMFuncRegistry defined in the module, must not be NULL.
- The code.
The C runtime allocates a contiguous block of memory at the specified address, then stores the code in that block. When the upload finishes, the C runtime writes _tvm_backend_functions to point at its own internal implementation. Then, it instantiates a new module into the global module table, and sets the module’s TVMFuncRegstry pointer as given in the upload_and_link call. Finally, a TVMModuleHandle is returned.

From this point, the user can use the uploaded blob either according to Module-based Model Runtime Interface (i.e. for experimentation) or by individually looking up functions (I.e. for autotuning).

Finally, let’s address C3. In my mind, there are two aspects to C3: the physical memory block that holds the tensor, and the address of the tensor data relative to the cache row size. For the most part, the second issue should not be a concern, because we are generally going to be tuning with large parameters. It could potentially impact the reproducibility of measurements with small tensors e.g. kernels, though.

For the moment, let’s concern ourselves with the first aspect: we still may need to place parameters in flash for autotuning to match the production system. One easy optimization is to place one or two parameter tensors in flash with the runtime, and provide a special PackedFunc akin to _lookup_linked_param (or perhaps we just reuse this one) to provide the autotuner with a DLTensor data handle. This approach should work across autotuning runs of a single kernel. We could consider generalizing it by computing all candidate input shapes, but this may be complex and perhaps unnecessary.

I’d love to hear more thoughts or concerns from your side about an approach like this. We could also do this a different way by pushing the RPC server onto the host, but it would be good to spell out the pros and cons of that approach more specifically.

Leo-arm · December 18, 2020, 12:43pm

Let me step back a little first, to see how a tuner app might fit in and what effect that has and then work through the details in your post.

Currently, there is relatively tight integration between the build flow and TVM. The assumption is that TVM manages the building of both a deployment binary and whatever is needed for the tuning (possibly the tuning app). The alternative is to flip this around, and use the embedded OS’s workflow to do the building, and have TVM feed the necessary artefacts into this build process. I think there are real advantages here.

B1. Integrate build functionality into TVM to build a deployment binary and a tuning binary B2. Let TVM generate the necessary artefacts and let the EOS build flow do the building.

I see advantages with B2. EOS builds are tricky: compiler and linker flags need to be just so, we may need linker scripts for placement, we have application code, external libraries, artefacts to include, etc. Boards need to be flashed over USB, Ethernet, UART, rebooted or not, etc. Build flows for embedded systems are mature and can deal with these issues quite well and developers are familiar with the flow. If TVM generates the artefacts as C source, then the EOS build can pick these up and compile them correctly. This I believe is a common flow. I’ll expand on this in a further post (I am supposed to have a day off today. ;-). I think this matters because it governs to some extend how we might do a tuning flow. For example, we could put weights in flash (we may have to because of size) which can be done with a linker script, or we compile some operator code into ram as you say. I think this is best managed on the EOS side.

As an aside, we discussed here if it is sufficient to only tune on one EOS, e.g. Zephyr. We think that this is the case; if anyone has a different experience then that would be really interesting to know.

Your 5 steps to run tuning code look reasonable to me. I think that as a matter of principle we should try to minimise the runtime; I would be in favour of generating the function calls directly and have a very lean runtime. Not even functions for memory allocation. This needs more thinking on my end, I am fuzzy on what is in there now and what the direction of travel is.

C3: When you say that there potentially impact the reproducibility of measurements of small tensors, are you thinking cache line alignment? This is a memory management issue, as said, not sure what the thinking is there at the moment. Agreed on the tensor approach, that makes sense. Can you elaborate more on how this would work with the RPC server on the host? You have a sketch in the first post in this thread that we did not really understand. What is the flasher machine?

areusch · December 18, 2020, 7:21pm

@leo-arm thanks for your reply!

First a small aside–I’m assuming EOS means Embedded OS.

I agree we should avoid putting logic into TVM that drives external build systems, and B2 is the right direction to go. I think the question here is the interface between TVM and the EOS build flow.

Let’s consider the required interfaces case by case:

When building for standalone deployment, TVM doesn’t need to do anything more than build artifacts. The EOS build system can then integrate them as appropriate.
When building for host-driven deployment (host GraphRuntime), TVM at least needs to be able to drive computation on the device. Currently, TVM’s interface for driving remote computation is the RPC system.
When building for autotuning, no GraphRuntime is needed, but TVM needs to be able to drive remote computation and run builds. The remote computation interface is here again the RPC interface. Driving embedded builds has no precedent in TVM–it has typically just called into gcc/llvm and built shared libraries.

Since the autotuning driver code is device-agnostic, I do think we need to stick with the RPC module interface as a means to drive execution on microTVM. Going away from this means significant divergence from the way autotuning works with every other backend.

Then the question is where the RPC server should live. This doesn’t mean it has to live onboard the embedded device, but given the granularity of the RPC interface, implementing a proxy doesn’t seem very trivial. On the other hand, the RPC server is complex and bugs on-device can be tricky to solve. My thinking is that an RPC proxy server would be quite complex and would itself require a fairly tight binding to the device (I.e. our previous approach, which wrote code directly over SW-DP, needed a detailed description of the device memory map in Python). We moved away from the previous approach to avoid pushing too much device-specific logic into TVM, but it may not matter so much if it lives behind an RPC server. I think it would be worth elaborating on the design though, as I suspect it would be complex to implement.

With regards to supporting >1 EOS: I agree it’s likely results from Zephyr should transfer fairly well to other OS so long as the chip is configured identically. I don’t know that’s guaranteed. Further, in my mind, microTVM should be a suite of tools one can use to deploy models. I definitely think we should have a default flow that has a high success rate, but I also don’t want to preclude the use of other build systems or runtimes.

Now with regards to memory: I also agree we should be going to a world which does not need malloc. The main uses of malloc at this point are for data tensors. To remove that need, we need to settle on a memory model we can use for planning in TVM. Since this lives up the stack a bit (in GraphPlanMemory), I do think this needs to be agnostic to the hardware used. We intend to work on this further in Q1 2021, after which I think we should revisit the use of malloc.

With regards to minimizing the runtime, the amount of code required again depends on the scenaior:

in standalone deployment, theoretically only the TVMBackend* functions should be needed.
in host-driven deployment, the common, rpc_server, and rpc_common libraries are needed if we continue with the RPC server on-device.
for autotuning, the same libraries as host-driven are needed.

I would argue that the scenario that really demands a lightweight runtime is the first (standalone). In the other cases: yes, it’s important to minimize footprint, but you can’t get away from SoC initialization, and the RPC server is likely going to occupy space that you needed in production for application code (but we could revisit this if it’s not true).

To this end, we also plan to work on the Ahead-of-Time runtime in Q1 2021, which removes the need for a GraphRuntime, and should operate without any of the common library. These two pieces taken together should result in quite minimal overhead in terms of deployment.

C3: When you say that there potentially impact the reproducibility of measurements of small tensors, are you thinking cache line alignment? This is a memory management issue, as said, not sure what the thinking is there at the moment.

yeah that’s correct. we intend to look at memory management more in Q1 with the memory planner, so not much to say here yet.

Can you elaborate more on how this would work with the RPC server on the host? You have a sketch in the first post in this thread that we did not really understand. What is the flasher machine?

imagine there is a device farm, and the flasher machine could be a box connected to ethernet running TVM on a traditional OS. Effectively a supervisor box not necessarily fully controlled by the user running autotuning. In this case, the flasher machine is running an RPC server separate from the microTVM RPC server. The flasher’s RPC server can program the board and setup a proxy link (see session_constructor_args and InitServer).

Leo-arm · December 21, 2020, 2:45pm

Yes, with EOS I meant any one of a set of embedded operating systems.

I am a bit confused by interface cases 2 and 3. From the embedded application’s point of view, are these not the same thing, at least conceptually? There is a request for a function call on the RPC socket which gets looked up, and the function called. This might be one operator or it might be the inference for an entire model but the mechanism is the same. Maybe I am missing something here? Is the difference maybe that in the host runtime model, the graph remains the same, and for the tuning case we want to tune many different operator variants? For the tuning case we can use the steps 1-5 you outlined above, effectively a simple dynamic linking procedure for code placed in ram. I doubt that there is a OS-agnostic way to do that; maybe this can be implemented with generic shims. This might only be needed for Zephyr.

I take the point that the RPC server takes space that is in deployment used by the application. I suspect that sensor applications may be an exception (very small app, just sensor interface, inference + comms out) but we can worry about that later.

Where I was going with this is that I was hoping we could simply generate C code, and, depending on scenario, copy the runtime source, runtime + rpc server sources, into the EOS project and do all building there. The exception would be the binary meta data module, e.g. weights. I don’t think it would avoid having to pass in some compiler flags but the scope for problems would be reduced because no code gets compiled. This would then allow a higher level of integration where (GUI) project configurators select TVM as a module, developers use TVMC to generated network module source code into their projects, and build. An embedded application in a day (maybe).

areusch · December 21, 2020, 8:03pm

I am a bit confused by interface cases 2 and 3. From the embedded application’s point of view, are these not the same thing, at least conceptually?

nearly. in case 3, the platform timer is also used. you still have to supply a TVMPlatformTimerStart function in both cases, though. from the host’s perspective, a GraphRuntime is needed somewhere in case 2.

I doubt that there is a OS-agnostic way to do that; maybe this can be implemented with generic shims.

I agree it’s a much narrower interface to let the Embedded OS handle compilation. I think it should be possible to define a portable function table as a struct, and then use that as the “link” procedure. let me know if you are thinking of something more complex.

Where I was going with this is that I was hoping we could simply generate C code, and, depending on scenario, copy the runtime source, runtime + rpc server sources, into the EOS project and do all building there.

I agree this is the right direction to go. I’m planning to work on this in the new year, but happy to review RFC/PR if you want to push on it earlier than that.

The exception would be the binary meta data module, e.g. weights.

Just curious: why do you see this as an exception? I think they still need to be compiled into flash (if --link-params is given), and the Embedded OS is the right place to do that per our previous reasoning. Otherwise, yes they would be sent over RPC link to RAM.

Leo-arm · December 24, 2020, 10:45am

I thought the meta data might be an exception because of the compile times. Some work was done I believe to do that efficiently with LLMV.

I am very not familiar with Zephyr so I will have a look at that and see how we might use that best.

areusch · December 24, 2020, 4:58pm

I think we should support generating both C source and a compiled binary–they are both quite similar as far as an external Embedded OS compiler is concerned, but the user may have preferences we can’t anticipate (debugging, compile times, knowledge, etc) that makes one format better than another. Right now you can choose one or the other for all generated µTVM code. We may want to make this more configurable.