[pre-RFC] C Device API

Mousius · August 24, 2021, 11:26am

Summary

I want to write an RFC to provide an API which can be used by the C runtime to abstract the variety of driver APIs for different platforms. This is specifically catering towards RTOS abstractions for embedded device drivers.

Motivation

When using an accelerator, such as the Arm® Ethos™-U, an Embedded Real-Time Operating System (RTOS) will provide a device abstraction to access the device resource. When using these abstractions, TVM needs to understand how to interact with a device for a given platform.

Taking the common example of a UART interface (imagining the accelerator is communicated to via this interface); in Zephyr, this would look similar to:

#include <zephyr.h>
#include <device.h>

struct device *uart_dev = device_get_binding("USART0");

char data[] = "Hello World!\r\n";
uart_tx(uart_dev, data, sizeof(data), 100);

Whereas in CMSIS, this would look more similar to:

ARM_DRIVER_USART* uart_dev = &Driver_USART0;
uart_dev->Initialize(NULL);

char data[] = "Hello World!\r\n";
uart_dev->Send(data, sizeof(data)/sizeof(data[0]));

In this example, you can see the diversity of RTOS implementations for drivers and why it’s required to provide a flexible abstraction to pass devices for micro targets.

Guide-level explanation

User App

The tvm_device_ts are implemented for each RTOS or platform required, these are included by the user who chooses as appropriate for their application. Notably, to avoid dynamic allocation, the user must provide the tvm_device_t struct and initialise it rather than it being created and setup for them in the API.

#include <tvm/runtime/device.h>
#include <tvm/platform/zephyr.h>

tvm_device_t accelerator; // Opaque type for accelerator device
TVMDeviceInit(accelerator);

// Platform specific call
TVMDevicePlatformBind(accelerator, ...platform specific parameters);

struct tvmgen_mynetwork_devices devices {
    .accelerator = accelerator
};

int32_t ret = tvmgen_mynetwork_run(
    ...,
    &devices
);

TVMDeviceDestroy(accelerator);

Platform Structures

Users can take a implementations from src/runtime/crt/platform and headers from include/runtime/crt/platform which maps to their platform device implementation. In the case of a bare metal environment, this would default to a void pointer as there’s no information available.

typedef tvm_device_t void*;

For RTOS implementations, a structure can be created such as this simple Zephyr wrapper (include/runtime/crt/platform/zephyr.h):

#include <device.h>

typedef struct {
    struct device* dev;
} tvm_device_t;

This enables the OS maximum control over the resources required and provides the opportunity to craft code in whichever way is most idiomatic for that platform, such as if an additional locking mechanism is required:

#include <device.h>
#include <kernel.h>

typedef struct {
    struct device* dev;
    k_mutex lock;
} tvm_device_t;

Generic Device API

The majority of the device API calls should be added to c_backend_api.h:

int32_t TVMDeviceInit(tvm_device_t* tvm_dev);
int32_t TVMDeviceOpen(tvm_device_t* tvm_dev);
int32_t TVMDeviceClose(tvm_device_t* tvm_dev);
int32_t TVMDeviceDestroy(tvm_device_t* tvm_dev);

These can all be implemented using the user-opaque context tvm_device_t, enabling the majority of TVM code to be portable between RTOS implementations; importantly this applies to those called within operator functions (see below). c_backend_api.h can then include the relevant platform/<PLATFORM>.h file where appropriate using #ifdef - if this becomes too unruly it can be added to c_device_api.h or similar.

Platform Device API

To allow setting of platform specifics into the opaque struct, these should be defined in the platform header. Alongside the header, an additional file will provide implementations (src/runtime/crt/platform/zephyr.c):

int32_t TVMDevicePlatformBind(tvm_device_t* tvm_dev, struct device* zephyr_dev);

int32_t TVMDevicePlatformBind(tvm_device_t* tvm_dev, struct device* zephyr_dev) {
    tvm_dev->device = zephyr_dev;
}

This simple wrapper enables type checking of these functions and defining a clear translation boundary between the underlying OS implementation and TVM.

Reference-level explanation

Entrypoint

The entrypoint API defined in Embedded C Runtime Interface is augmented with the devices structure which contains implemented tvm_device_t structs for each device used by the network. These are re-cast to void * when entering the AOT main function to pass it without TIR understanding the struct types.

int32_t tvmgen_mynetwork_run(
    ...,
    struct tvmgen_mynetwork_devices* devices
) {
    tvmgen_mynetwork_run_model(
        ...,
        devices->host,
        devices->accelerator
    );
}

Executor Function

Each operator is provided with a single device object which can be abstracted and passed as the void* resource_handle. The main function calls into the device API to setup and teardown resources before and after each operator call.

int32_t tvmgen_mynetwork_run_model(..., device0, device1) {
    TVMDeviceOpen(device0); // Could reserve or enable certain circuitry
    operator(device0);
    TVMDeviceClose(device0);

    TVMDeviceOpen(device1);
    operator(device1);
    TVMDeviceClose(device1);
}

Device API Functions

In the example of Zephyr, devices are already a first class concept so many of the functions will no-op but should synchronisation be required, an example implementation could be:

#include <device.h>

typedef struct {
    struct device* dev;
    k_mutex lock;
} tvm_device_t;

int32_t TVMDeviceInit(tvm_device_t* tvm_dev) {
    k_mutex_init(&tvm_dev->lock);
}

// Platform-specific
int32_t TVMDevicePlatformBind(tvm_device_t* tvm_dev, struct device* zephyr_dev) {
    tvm_dev->dev = zephyr_dev;
}

int32_t TVMDeviceOpen(tvm_device_t* tvm_dev) {
    k_mutex_lock(&tvm_dev->lock, K_FOREVER);
}
int32_t TVMDeviceClose(tvm_device_t* tvm_dev) {
    k_mutex_unlock(&tvm_dev->lock);
}

int32_t TVMDeviceDestroy(tvm_device_t* tvm_dev) {
    tvm_dev->dev = NULL;
}

Whereas for CMSIS, you can use the platform-specific function to encapsulate the API to our imaginary UART accessed accelerator:

typedef struct {
    void* dev;
} tvm_device_t;

int32_t TVMDeviceInit(tvm_device_t* tvm_dev) {}

// Platform-specific
int32_t TVMDevicePlatformBindUart(tvm_device_t* tvm_dev, ARM_DRIVER_USART* uart_dev) {
    uart_dev->Initialize(NULL);
    tvm_dev->dev = uart_dev;
}

int32_t TVMDeviceOpen(tvm_device_t* tvm_dev) {}
int32_t TVMDeviceClose(tvm_device_t* tvm_dev) {}
int32_t TVMDeviceDestroy(tvm_device_t* tvm_dev) {}

Operator Usage

Each operator would be expected to utilise one device structure and be passed that as the resource_handle parameter, making the assumption that each operator or variant of an operator is only bound to one device at a time. In the following example it can be seen how a accelerators interface is implemented per platform to take this void pointer and call the platform specific driver code.

// Operator takes opaque resource_handle
int32_t my_operator(..., void* resource_handle) {
    if (TVMMyAcceleratorInvoke(resource_handle, ...ins,outs,params...) != 0) {
        return -1;
    }
}

// Platform implementation
int32_t TVMMyAcceleratorInvoke(struct device* zephyr_dev) {
    my_accelerator_invoke(
        zephyr_dev,
        ...ins,outs,params...
    );
}

PrimFunc Resource Handle

A tir::Var is added to PrimFunc in include/tvm/tir/function.h which enables a PrimFunc to track and use the resource_handle parameter. This will be used by both unpacked and packed APIs to pass the resource down without packing into TVMValue, instead as a void *.

When this is packed in the lowering phase, the resource_handle will be assumed to exist as the last argument after being provided by the executor code generation. The eventual Call returned in lower_tvm_builtin.c contains the resource_handle by removing this final argument:

auto arg_count = op->args.size() - 1;
resource_handle = op->args[arg_count];

// ... packing using arg_count reduced by one

return Call(
    op->dtype,
    call_cpacked_lowered(), 
    {
        op->args[0],
        scope.stack_value,
        scope.stack_tcode,
        ConstInt32(arg_stack_begin),
        ConstInt32(arg_stack_begin + op->args.size() - 1),
        resource_handle
    }
);

Device Discovery

Initially, devices will be defined by Target name or external compiler name. This means if you mark an operator as needing an external woofles compiler it would result in a devices struct such as:

struct tvmgen_my_model_devices {
    tvm_device_t* woofles
};

Which would be passed down to the relevant operators via the executor. This applies similarly to Target defined devices.

Drawbacks

Current limitations with Target and external compilers mean that only one of each name can occur at once using this system, this could equally be in future work.
The initial assumption is that each operator will be mapped to a single device, this design choice means that fusion across devices will not be possible.

Rationale and alternatives

We could leverage more code generation to generate device structures. It is the authors belief that being able to write small self-contained platform implementations will be easier to understand for both users and developers of TVM.

Another route to take is to treat RTOSes as entirely separate from TVM, requiring them to fully configure resources before passing in the void*. This removes TVMs ability to add hooks for resource management such as open and close which could be used to enable/disable entire pieces of circuitry between operators.

Prior art

Uses the existing resource_handle in the TVM code which isn’t currently propagated
Extends the C Interface API to add support for devices
Resource management using open/close and init/destroy alongside opaque handles is a common pattern in C libraries

Unresolved questions

Future possibilities

This RFC aims to put in place the foundation of the Device API to start abstracting the various RTOS drivers. There are other flows that have been considered as extensions to this.

Memory Copies

Movement of memory between additional devices which may be unable to communicate directly, this could take the form of simply:

// Copy from/to
int32_t TVMDeviceCopyFrom(tvm_device_t* source, void* destination);
int32_t TVMDeviceCopyTo(void* source, tvm_device_t* destination);

And be integrated into the flow as follows:

TVMDeviceOpen(device1);
operator(..., device1) {
    // some work where device1 can read from memory directly
    // then the result is copied back
    TVMDeviceCopyFrom(device1, &buffer);
}
TVMDeviceClose(device1);

TVMDeviceOpen(device2);
operator(..., device2) 
    TVMDeviceCopyTo(&buffer, device2);{
    // some which only device2 can see
    TVMDeviceCopyFrom(device2, &output);
}
TVMDeviceClose(device1);

The additional operations here require further thought, but the Open/Close API wrapper demonstrated supports it as an extension. Moving some of these calls into the executor may also enable asynchronous memories copies from within TVM.

Mousius · August 24, 2021, 11:27am

CC: @manupa-arm @grant-arm @areusch @stoa @MJKlaiber

areusch · August 30, 2021, 8:11am

@mousius Thanks for this important RFC. I’d like to approach this from the other way around: what parts of device configuration does it make sense for TVM to involve itself with, and which parts don’t necessitate any binding between TVM runtime and device control? In this framework you can almost think of TVM providing a set of “device callbacks” which it invokes when the user requests it to do something that might necessitate a change in device state.

In the Module-based Model Runtime Interface, there is a lifecycle roughly as follows:

        +---------------+
        | uninitialized |    <-----------------+
        +---------------+                      |
                ↓  (instantiate Executor)      |    (destruct Executor)
        +---------------+    ------------------+
   +--> |  initialized  |      <- device memories available for preloading; constants loaded
   |    +---------------+
   |            ↓  (start of run())
   |    +---------------+
   |    |   executing   |      <- device available to launch compute tasks with low latency
   |    +---------------+
   |            ↓  (end of run())
   +------------+

I can appreciate we aren’t necessarily keeping compatibility down to the letter with Module-based Model Runtime in microTVM. However, internally the compiler needs some model of the executor strategy to work with. I don’t think we’ve conceptually gone away from the MBMR model yet, and prefer to keep with this even if the specific APIs used on microcontrollers don’t exactly replicate the C++ executors.

With this in mind, it seems like we may like to have a callback for each transition in the graph above. I could see this as:

TVMDeviceOpen – to match instantiating the executor. Contract is that the device memory becomes available for use.
TVMDeviceActivate – to match starting the run function. Contract is that the device exits any low power state which may impact inference latency.
TVMDeviceDeactivate – to match ending the run function. Contract is that the device may re-enter any low-power state left in Activate, but must maintain device memory state.
TVMDeviceClose – to match ending the run function. Contract is that the device may be released for others to use.

Now these look suspiciously similar to your Open/Close and Init/Destroy–so forgive me if I’ve written a bunch of text only to agree with you. I’m not attached to the names I’ve used; but let’s make sure we write down the contracts for these functions. I think your function signatures look fine to me.

Type of `tvm_device_t`

I like the idea of making this platform-specific, but I wonder if there will be device-specific state that may be unnecessarily replicated across multiple accelerators (e.g. tvm_device_t is sort of forced to be a union struct if it is only platform-specific and not device-specific). Should we further narrow this to e.g. tvm_device_woofles_t?

Device API functions

It would be best to assume we’ll need to implement the full C++ Device API even if most of the functions are no-ops.

Follow-ups

When this is packed in the lowering phase, the resource_handle will be assumed to exist as the last argument after being provided by the executor code generation. The eventual Call returned in lower_tvm_builtin.c contains the resource_handle by removing this final argument:

Is this specific to the AOT main function’s TIR? It seems like it may be hard to verify that a TIR Call node has resource_handle included correctly with the args. Should we track resource_handle separately from the ins and outs? (I realize this may have been the subject of another PR which I pushed back on–so now that I have context we could probably reconsider).

Initially, devices will be defined by Target name or external compiler name. This means if you mark an operator as needing an external woofles compiler it would result in a devices struct such as:

It would be great to note that this applies to the Target string.

Finally, it would be great to spell out the full Device API somewhere so it’s clear the full extent of this proposal.

Mousius · September 1, 2021, 9:42am

I’m glad this does align with the outline for the Module-based Model Runtime, hypothetically we should be able to wrap the concepts in the C API with the C++ API rather than have both.

I think we’re arriving at similar hooks required with slightly different use cases, this is one of the reasons I used slightly more generic language rather than Activate and Deactivate, it’s likely that Open may actually just lock a resource and Close unlocks it so that we can run multiple threads at the same time without causing collisions in the driver (this is alongside providing memory for copies or enabling circuitry).

One thing I’m conscious of, is that I’d rather keep the majority of the API as opaque to the user as possible to reduce the complexity when choosing which function to call. In that sense I’d rather have the non-device specific signatures of:

TVMDeviceInit(tvm_device_t* woof);
TVMDeviceOpen(tvm_device_t* woof);
TVMDeviceClose(tvm_device_t* woof);
TVMDeviceDestroy(tvm_device_t* woof);

Rather than implementing full life cycles for each device as a collection:

TVMDeviceWooflesInit(tvm_device_woofles_t* woof);
TVMDeviceWooflesOpen(tvm_device_woofles_t* woof);
TVMDeviceWooflesClose(tvm_device_woofles_t* woof);
TVMDeviceWooflesDestroy(tvm_device_woofles_t* woof);

My suggestion is therefore to allow whatever platform abstraction makes sense, i.e. in the case of Zephyr a tvm_device_t can be constructed as:

typedef int (*device_open)(struct device *dev);
typedef struct {
    struct device* dev;
    k_mutex lock;
    device_open open;
} tvm_device_t;

Where-in the actual device driver can configure default hooks and bind an appropriate open call when the user calls TVMDeviceWooflesBind(tvm_device_t* dev, struct device* actual_dev); as a platform specific call. This may be different for a C++ RTOS which would instead store a class Device in the tvm_device_t as a void* or similar and the TVMDeviceOpen layer actually calls into that class.

I have mixed feelings about listing out every C++ Device API call here and designing everything up front, I can see that we’ll eventually get there but I also think there’s a small piece we can put in place first and see if our assumptions hold. This RFC is more about designing the basic signatures for the C calls and we can grow it with additional RFCs as we introduce additional requirements.

This would apply to all operators, so they can use the resource_handle of the device specific to that call of the operator. This would mean you’d have the signature for a packed function:

typedef int (*TVMBackendPackedCFunc)(TVMValue* args, int* type_codes, int num_args,
                                     TVMValue* out_ret_value, int* out_ret_tcode,
                                     void* resource_handle);

Which would initially be represented by an intrinsic similar to:

tir.Call(tvm::tir::builtin::tvm_call_packed(), { "operator_woofles", input, output, resource_handle });

That when lowered would pluck that resource_handle and use it as the last argument, similar to:

tir.Call(tvm::tir::builtin::tvm_call_packed_lowered(), { "operator_woofles", stack_value, stack_tcode, stack_begin, stack_end, resource_handle });

Which would result in the correct call to that operator with the correct resource_handle. This can be sene in a PR for the unpacked API (Pass resource_handle to operators with unpacked API by Mousius · Pull Request #8452 · apache/tvm · GitHub</titl), this demonstrates the addition of the resource_handle to the PrimFunc and plucking the last argument as the resource_handle.

areusch · September 4, 2021, 4:20pm

hi @mousius,

One thing I’m conscious of, is that I’d rather keep the majority of the API as opaque to the user as possible to reduce the complexity when choosing which function to call

What do you mean exactly? I think it should be very clear what an implementer of these functions should do. My concern with the generic TVMDeviceInit is that it presumes the existence of a director function which routes the call to the correct device. When you are writing code that runs on top of a complex RTOS, you are likely to have that. When you aren’t, this introduces an additional step to the application developer: defining the Device API for the platform. That is to say, before this change, a user would need to do the following to integrate a model with firmware:

Run tvmc compile path/to/model.tflite
Take the resulting Model Library Format tar, unpack and copy into project
Allocate memory for the compuation.
Instantiate executor, fill the input tensor, invoke the executor.

If a single function is used for all accelerated devices, then the user needs to accomplish an additional step: write a “director function” for each function in the device API like so:

int32_t TVMDeviceInit(void* opaque_pointer) {
  if (opaque_pointer == &global_accelerator_context) {
    accelerator_init(&global_accelerator_context);
  } else if (opaque_pointer == &cpu1_context) {
    init_cpu1_thread_stack();
  } else {
    CHECK(FALSE);
  }
}

I was sort of expecting accelerator vendors to supply some implementation of the Device API which could be incorporated into firmware projects to bridge between TVM and the accelerator. With the director function, the user sort of gets in the middle between TVM and the library. I’m not sure I see an advantage of that (but happy to be argued otherwise). It seems like it burdens the user a bit more than we’d like.

The downside of the approach I just proposed is that then we are creating quite a few separate dispatching functions. However, in practice that’s the same thing the Device API does–just under a cleaner C++ class namespace–and I also think that it’s unlikely a deployment would have more that 2 or 3 devices (and if so, could likely afford the flash to include them). What do you think?

Finally, if we do stick with the one-Device-API-function-per-platform approach, I think we should prefix the functions invoked from TVM with TVMPlatform to place them under that namespace.

This would apply to all operators, so they can use the resource_handle of the device specific to that call of the operator. This would mean you’d have the signature for a packed function:

My one concern with this is that we currently have sort of reserved the use of the resource_handle in the C++ runtime for capturing a this pointer. The device object is like this–it’s essentially a context pointer used in implementing the device API. However, at present resource_handle is essentially expected to be nullptr and only the class of PackedFunc encompassing the BYOC functions on the C runtime have a clear definition of resource_handle under this model. So I’m okay with treating the tvm_device_t* as resource_handle for those functions, but I don’t want to make it the convention for all TVM-generated code. Perhaps we should lookup kCompiler and use this attribute when generating the top-level AOT function (and in GraphExecutor) to decide how to pass resource_handle?

I have mixed feelings about listing out every C++ Device API call here and designing everything up front, I can see that we’ll eventually get there but I also think there’s a small piece we can put in place first and see if our assumptions hold. This RFC is more about designing the basic signatures for the C calls and we can grow it with additional RFCs as we introduce additional requirements.

Okay that’s understandable, but can we at least list out all of the functions used in the C AOT and Graph executors plus those used from c_backend_api.h? Those are the core functionality we will need to implement with this proposal.

Mousius · September 6, 2021, 2:30pm

Hi @areusch, it’s worth noting that my reply here predates the conversation we had in the microTVM meetup, I’ve understood your concerns and agree to adopt the mangled symbol implementation - I don’t want to tread that ground again unnecessarily but I will reply to help provide clarity where I can

The concern is more about having to produce name mangled versions of each function rather than a fixed stable entry API, but I think we agreed we can work around it if we fix some of the issues we have with name mangling as is (doesn’t match the C style and is re-implemented in both C and Python)?

This was illustrated above as functioning with pointer tables which we agreed weren’t portable enough to use for implementing this and just introduced the same basic signatures within a pointer table rather than directly to avoid having to use direct symbol names.

It’s worth noting this will be extended by https://github.com/apache/tvm-rfcs/pull/10 to mean that there’ll be official Targets that can be marked for this to occur as well. I’d prefer to use the Target registry to enable/disable the generation and allow the inspection from the executor, potentially something similar to:

.set_attr<Bool>("device_api", True)

Which would allow some customisability, I think this will have to be implemented this way to ensure other BYOC targets don’t get calls generated unnecessarily? As the functions are now annotated by Target in LowerTE we should eventually get a way to pull this directly from a Target at the executor level?

Primarily, I think the focus here is providing a way for an executor to pass something to the operator where-as with the current implementations they can’t. The defined behaviour here is only scoped to this RFC and Targets (BYOC or otherwise) which have support for this.

The scope of this proposal is to introduce the framework and some of the hooks necessary to lay down the foundations which can be used for future iterations of the C Device API. I’d suggest not trying to bundle every Device API interaction into a single RFC but considering each case in isolation; in this case we’re looking at activation and resource acquisition hooks which are reasonably simple to define alongside the infrastructure to generate them. I don’t think that there’s anything here that would block that future expansion?

Mousius · September 7, 2021, 3:28pm

Given this seems to be moving in a positive direction I’ve raised a PR against tvm-rfcs, let’s move the conversation there: https://github.com/apache/tvm-rfcs/pull/31

areusch · September 7, 2021, 11:26pm

The concern is more about having to produce name mangled versions of each function rather than a fixed stable entry API, but I think we agreed we can work around it if we fix some of the issues we have with name mangling as is (doesn’t match the C style and is re-implemented in both C and Python)?

For the Device API, are you saying we’d need to mangle the compiler name? e.g. someone could define a compiler named “ARM (R) Ethos-U” which would need to become TVMARM__R__Ethos_UDeviceInit()? I can see an argument for doing this mangling, though I do think we could also just require that kCompiler be a piece of a valid C symbol identifier.

It’s worth noting this will be extended by https://github.com/apache/tvm-rfcs/pull/10 to mean that there’ll be official Targets that can be marked for this to occur as well. I’d prefer to use the Target registry to enable/disable the generation and allow the inspection from the executor, potentially something similar to:
.set_attr<Bool>("device_api", True)

Marking this as a property of the Target or Compiler using that infrastructure makes sense to me. However, I want to consider what properties we may apply to “builtin” TVM targets (e.g. CUDA, Hexagon, etc). In the C++ runtime, these targets already have a globally-registered instance of the Device API. In TIR, we need to ensure we model things properly so that codegen can emit a call to either the C Device API or C++ Device API. I think in general I’d suggest two things:

Let’s pick a better property name than "device_api" since there is already that DeviceAPI above
If a Target doesn’t have a C device API and we are targeting the C runtime, can we proceed with codegen? Do we just presume to use the same device API (e.g. memcpy for CopyFromTo) as in llvm? Should that be a formal default implementation of Device API, or modeled differently in TIR? I sort of think it should be a formal default implementation.

Primarily, I think the focus here is providing a way for an executor to pass something to the operator where-as with the current implementations they can’t. The defined behaviour here is only scoped to this RFC and Targets (BYOC or otherwise) which have support for this.

I agree; I just want to make sure we have a way to account for the C++ DeviceAPI already having a this pointer and thereby not needing anything passed in.

[pre-RFC] C Device API

Summary

Motivation

Guide-level explanation

User App

Platform Structures

Generic Device API

Platform Device API

Reference-level explanation

Entrypoint

Executor Function

Device API Functions

Operator Usage

PrimFunc Resource Handle

Device Discovery

Drawbacks

Rationale and alternatives

Prior art

Unresolved questions

Future possibilities

Memory Copies

Type of tvm_device_t

Device API functions

Follow-ups

Type of `tvm_device_t`