Clarification on BYOC for new hardware and the device API

slai-nick · June 12, 2023, 1:33pm

I am currently working on integrating a new hardware with TVM and I am confused with the combination of the device API, BYOC and memory planning. I don’t know exactly how I should proceed and would really appreciate some guidance.

The device API gives me a “dynamic runtime” feel if that makes sense, while I was thinking to generate more “static code”. I should mention that I am particularly interested by the AOT executor.

Here are my questions:

In what cases do I need to provide a device API for BYOC for a new hardware?
If I need to provide a device API, how does it play with the codegen? Do I need to generate calls to the device API? (I might as well just generate the code implemented by the device API)
What if we add USMP to the pipeline?

PS: I am developping on the unity branch, hence the tag, I can move the post to a more general topic if inappropriate

tqchen · June 13, 2023, 1:23pm

Usually device API is useful when you have runtime function that allocates a special device memory (e.g. CUDA memory) on device. If you only want to do BYOC on an existing memory, then device API is not strictly necessary.

Under unity, BYOC allows you to translate part of the code to extern function calls(via call_dps_packed), which can then be hooked up to the TIR function that is more static.

slai-nick · June 14, 2023, 10:49am

What do you mean by doing BYOC on an existing memory? The hardware resembles the GPU in terms of memory architecture where we have a storage for allocation from the host and other memories closer to the different cores. The storage is the entry point for data on our chip and everything move from there. There are 2 cores and each of them have very different functionalities (a matrix multiplier and another one for other operations, both having tensor operations).

With AOT I was on the impression that it was possible to just codegen the necessary memory copies at the offloaded functions boundaries and any memory movement within these functions and thus not needing to rely on the device API.

I think some of my misunderstanding comes from the fact that I haven’t understood yet what is the origin of the calls to the device API. Is it because I have generated calls to the device API with BYOC or are these calls generated by TVM at other stages of the build?

Thank you for taking the time to answer.

tqchen · June 14, 2023, 12:58pm

In this case, there are two choices:

Try to not make the device memory visible to the host, and leverage host memory. That means the BYOC always takes the host memory as a starting pt, do copy and then copy back to host memory. This is likely related to what you mean by “codegen the necessary memory copies at the offloaded functions boundaries”. DeviceAPI is not needed in this case as the memory are opaque to TVM runtime itself.
As a next step, if we want to start reuse things like memory planning, or TensorIR codegen/fusion that leverages the existing memory. Then DeviceAPI is needed (more specifically the functions in c_runtime_api.h), so that specific runtime function can be called. The deviceAPI can be called from runtime side, as well as the codegen part(via code like TVMArrayAlloc. This also applies to the case when we codegen functions that needs to take these device array as input(as a result we need ways to allocate them)

slai-nick · June 15, 2023, 2:23pm

Thank you for your answer.

I studied the runtime sources in the meantime and was able to trace calls to the device API and it makes much more sense. I don’t see a lot of value in avoiding to implement it anymore.