[µTVM] Simplifying the Compiler interface

@leo-arm thanks for your reply! I agree we should find a way to avoid requiring flashing on each AutoTVM iteration. let’s think this through a bit more on this thread since it would impact one of the core use cases of a Project abstraction. I think ultimately we should propose the full design on a separate RFC.

first let’s summarize the concerns:

C1. we cannot erase the flash too much

C2. we need to configure the system to match production, performance-wise.

C3. the location of parameter tensors could impact performance measurements.

in service of C2, some component needs to live at the Reset vector in flash, and we should also ensure we control all IRQ handlers, particularly the unimplemented ones. Currently, the Zephyr runtime/main() startup code handles this, and I don’t think it needs to change for this proposal.

to work around C1, we could consider a solution like:

  1. Presume we have already flashed a minimal µTVM C runtime with RPC server but no compiled operators.

  2. either the runtime (via a PackedFunc RPC call) or the Project implementation provides additional information about the availability of RAM to execute code.

  3. When RAM is available, the Project implementation provides a function e.g. CompileToRAM which compiles the generated operator code with a modified linker script and places it in RAM.

  4. Additionally, the Project implementation needs to include a small shim library which defines trampoline functions for TVMBackend implementations (operator implementations can perfectly legally depend on these). The shim library also defines a global pointer, _tvm_backend_functions of type struct TVMBackendLinkTable*. Here is an example:

    struct TVMBackendLinkTable {
        int (*TVMBackendAllocWorkspace)(int device_type, int device_id, uint64_t nbytes,
                                        int dtype_code_hint, int dtype_bits_hint);
        int (*TVMBackendFreeWorkspace)(int device_type, int device_id, void* ptr);
        // Additional TVMBackend functions...
    }
    
  5. The device is reset and a new transport is opened.

  6. Using a new upload_and_link RPC call, the compiled code is sent to the TVM C runtime. The following information is sent:

    • The start address of the code
    • The size of the code, in bytes
    • The address of _tvm_backend_functions.
    • The address of the TVMFuncRegistry defined in the module, must not be NULL.
    • The code.

    The C runtime allocates a contiguous block of memory at the specified address, then stores the code in that block. When the upload finishes, the C runtime writes _tvm_backend_functions to point at its own internal implementation. Then, it instantiates a new module into the global module table, and sets the module’s TVMFuncRegstry pointer as given in the upload_and_link call. Finally, a TVMModuleHandle is returned.

    From this point, the user can use the uploaded blob either according to Module-based Model Runtime Interface (i.e. for experimentation) or by individually looking up functions (I.e. for autotuning).

Finally, let’s address C3. In my mind, there are two aspects to C3: the physical memory block that holds the tensor, and the address of the tensor data relative to the cache row size. For the most part, the second issue should not be a concern, because we are generally going to be tuning with large parameters. It could potentially impact the reproducibility of measurements with small tensors e.g. kernels, though.

For the moment, let’s concern ourselves with the first aspect: we still may need to place parameters in flash for autotuning to match the production system. One easy optimization is to place one or two parameter tensors in flash with the runtime, and provide a special PackedFunc akin to _lookup_linked_param (or perhaps we just reuse this one) to provide the autotuner with a DLTensor data handle. This approach should work across autotuning runs of a single kernel. We could consider generalizing it by computing all candidate input shapes, but this may be complex and perhaps unnecessary.

I’d love to hear more thoughts or concerns from your side about an approach like this. We could also do this a different way by pushing the RPC server onto the host, but it would be good to spell out the pros and cons of that approach more specifically.