[AOT] Module-based Model Runtime Interface for AOT

areusch · September 27, 2021, 5:09pm

thanks for your reply! i’ll answer your questions below:

a) Would this be thread safe ?

No. The current GraphExecutor implementation is also not thread-safe. In general from TVM C++ runtime we presume only a single frontend to be using libtvm_runtime.so at a time. I suspect many things will break if this assumption is violated–this is why e.g. to parallelize the unit tests, we propose to use pytest-xdist, as it spawns subprocesses.

b) Do we need stateful setting of inputs, run and outputs ? (As opposed to something like follows :
aot_executor : AotExecutor = factory["my_mod"](tvm.cpu(0)) -- (Here the init is also run that creates the runtime.Modules using SaveToBinary artifacts)
output = aot_executor.run(a, b) 

AOTExecutor certainly does not require this, however Module-based Model Runtime Interface implies this due to the semantics of set_input and get_output. In fact, run is internally implemented in Python-land to accept parameters (well, keyword args) like you gave them and then first call set_input.

I believe there are runtime settings, particularly when using accelerators from a traditional OS such as linux or Windows, in which allocation operations are unpredictably expensive (e.g. due to context-switching latency incurred in the malloc-like call). In these cases, it’s perceived to be better to do the allocation in advance, so that the steady state inference latency is more predictable. cc @tqchen @junrushao if they have more background here.

Anyhow, this RFC doesn’t seek to propose a change to MBMR–it is the standard interface exported through the PackedFunc ABI from libtvm_runtime.so.

Q3. Why would we want to allocate space for params ?

Module initialization will:

Load the ModelMetadata using get_model_metadata PackedFunc.

Allocate space for the parameters to tvmgen_<model_name>_run_model.

Lookup and load any linked parameters using the --link-params mechanism.

set_input, get_input, get_output all work as they do in GraphExecutor.

run assembles TVMArgs containing inputs + outputs and invokes tvmgen_<model_name>_run_model.

time_evaluator is implemented in the same way as it is in GraphExecutor. Timing run_model is done using the CPU timer.

I think we should not be allocating space for params by default unless we have a good reason. A user-override might be an acceptable solution.

Thanks–this was a typo. I meant “model inputs.” In some cases, these could include parameters e.g. if parameters weren’t linked or if the old-style --link-params is somehow in use.

Q4. Why are we loading metadata in the runtime (it feels like goes against the concepts of AoT) ?

In general, the traditional execution flow in TVM uses metadata at runtime fairly extensively. As you said, in general it should be possible to push in a direction where all of the metadata required inside the run_model call could be encoded in the generated code. As you said, we could broadly group runtime metadata uses into two buckets:

To help the user integrate the application code against the model. In the C runtime, such integration is always presumed to happen ahead-of-time. In the C++ runtime, this assumption isn’t common, and that’s why targeting micros was so much of a lift from the existing TVM codebase. This is also why the Artifact refactor is more obvious for the micro use case–the code-loading process is more exposed there, and such uses of metadata at runtime are more obvious. Metadatas 1 and 3 is used for this as well as the Graph JSON and the new metadata needed by AOTExecutor.
For use during inference. I’d say so far only metadata 2 above is used in this way–there is a case to be made for 1, but it’s more at module load time so I’d consider it to be separate.

Loading metadata as a runtime activity feels like we are conceptually going against the requirements of “Ahead-of-Time” compilation flow. The metadata should be presented to a user to integrate the application rather than being used in a runtime flow.

I agree with this entirely. I think that the metadata generated by the compiler should be considered a first-class participant in the Artifact code-loading process (which is designed to be compatible with both the existing C++ code-loading flow as well as be consumable in the C/microTVM land by a firmware engineer).

Therefore, I believe any generated metadata should not be used from a special packed function however could be used between an entry_point (e.g. run()) and tvmgen_<model_name>_run_model .

I think this is generally analogous to how metadata is being used today (the one case above excepted). I think there are also some other cases even in microTVM land where metadata can be useful at runtime. For instance, the ST port defines a fairly comprehensive amount of flash-based model metadata, mainly for use by the application. While I don’t think that the C runtime code-loading flow should require incorporating such metadata as e.g. a flash-based struct, I do think we should allow frameworks to provide it as such if needed.