[AOT] Module-based Model Runtime Interface for AOT

areusch · September 17, 2021, 11:01pm

@tqchen @junrushao @comaniac @jroesch @mousius @manupa-arm @csullivan @mbs-octoml @kparzysz

Summary

This RFC describes a Module-based Model Runtime interface for the Ahead-of-Time Executor, thereby enabling its use from the TVM C++ Runtime.

Motivation

The microTVM project has made significant progress towards an Ahead-of-Time Executor for compiled Relay models. At the time of writing, it’s now possible to codegen a TIR function which executes Relay models that have known shapes, don’t have graph-level control flow, and execute only on the CPU device. Right now, the C runtime is the only such runtime environment which can interact with this generated code. However, significant interest exists in enabling the C++ runtime to use the Ahead-of-Time executor.

Guide-level explanation

Users select the AOT executor at compile time through the traditional GraphExecutor compilation flow (e.g. tvm.relay.build) by including --executor=aot in the Target [1]. The return value of tvm.relay.build in this case is an AotExecutorFactory Module object. Users instantiate the AOT executor via AotExecutorFactory as they do with GraphExecutor:

ir_mod = tvm.parser.fromtext("""\
      #[version = "0.0.5"]
      def @main(%a : Tensor[(1, 2), uint8], %b : Tensor[(1, 2), uint8]) {
          %0 = %a + %b;
          %0
      }"""
    )

with PassConfig(opt_level=3):
  factory : AotExecutorFactory = tvm.relay.build(
       ir_mod, "llvm -executor=aot", module_name="my_mod")

aot_executor : AotExecutor = factory["my_mod"](tvm.cpu(0))

AotExecutor supports the traditional Module-Based Model Runtime Interface and can be used as a user normally would GraphExecutor:

aot_executor.set_input("a", tvm.nd.array(np.ndarray([1, 2], dtype="uint8")))
aot_executor.set_input("b", tvm.nd.array(np.ndarray([3, 5], dtype="uint8")))
aot_exec.run()
output = aot_exec.get_output(0)
assert output.asnumpy() == np.ndarray([5, 7], dtype="uint8")

[1] NOTE: The target string is not the final place this customization should be made. However, it’s been the place where we’ve been putting runtime-related stuff. A separate RFC will split the Target string into Target options (which affect tuning) and runtime options.

Reference-level explanation

Already committed to TVM is the AotExecutorCodegen. This module produces a TIR top-level function which invokes the Relay operators (implemented in TIR) in a correct order. An example is given below:

PrimFunc([input1, input2, output]) attrs={"global_symbol": "tvmgen_my_mod_run_model", "runner_function": (bool)1} {
  // attr [(nullptr)] device_id = 0
  // attr [(nullptr)] device_type = 1
  tir.tvm_call_packed("tvmgen_my_mod_fused_add", input1, input2, output)
}

The AotExecutor then needs to accomplish the following to meet Module-based Model Runtime Interface:

Allocate input and output tensors as defined in the run_model function using the correct Device API
Provide a mapping from relay parameter name to positional argument
Invoke the generated TIR function and provide profiling.

Compiler ↔ Runtime Metadata

In order to implement (1) and (2) above, additional metadata about the run_model function needs to be communicated from Compiler to Runtime:

The mapping between Relay parameter name and TIR argument position
The number of inputs and outputs
The type of each parameter
Information sufficient to choose a Device API to allocate memory for that data.

At present, Metadata is passed from Compiler to Runtime in several different ways:

Constant DLTensor can be bundled with code and supplied to runtime::Module via runtime::MetadataModule
Many non-DSO-exportable backends (cuda, hexagon, metal, opencl, sdaccel, rocm, vulkan) have adopted the convention of including a runtime::FunctionInfo (NOTE: distinct from tvm::relay::transform::FunctionInfo) in their serialization:
```
/*! \brief function information needed by device */
struct FunctionInfo {
  std::string name;
  std::vector<DLDataType> arg_types;
  std::vector<std::string> launch_param_tags;
}
```

AotExecutorCodegen and GraphExecutorCodegen have adopted the practice of producing the graph-level runtime::MetadataNode:

/*!
 * \brief Structure that can be optionally used by the executor codegen
 */
class MetadataNode : public Object {
 public:
  /*! \brief input information for the main function */
  Array<String> inputs;
  /*! \brief number of outputs of the main function */
  int num_outputs = 1;
  /*! \brief the executor to be used to run the model */
  String executor = kTvmExecutorGraph;

  String mod_name = "";
}

The recent AOTExecutor implementation has created tvm::relay::transform::FunctionInfo which communicates statistics about memory usage and I/O operation for each TIR operator and aggregate statistics for the top-level AOT function:

struct FunctionInfoNode : public Object {
  Map<Target, Integer> workspace_sizes;
  Map<Target, Integer> io_sizes;
  Map<Target, Integer> constant_sizes;
  Map<Target, tir::PrimFunc> tir_primfuncs;
  Map<Target, Function> relay_primfuncs;
}

Some duplication of information is already present. Likely this is due in part to the existing middle-end compiler design, in which a separate IRModule is produced for each backend. Another factor may be: since runtime::Module are responsible for their own serialization, and passing Node across PackedFunc requires a cast, the lack of a centralized facility for runtime::Modules to obtain module-level Metadata has led backend authors to roll their own. This pattern means that it’s very difficult to assess the full scope of metadata handed to the runtime, particularly across all backends.

Work is currently ongoing to unify the pre-codegen IRModule into a single instance. After this work is completed, it will be much easier to produce a centralized module-level Metadata. This RFC argues for the expansion of runtime::MetadataNode in the following ways:

Rename runtime::MetadataModule to runtime::ConstLoaderModule to disambiguate the two and make its purpose in life clearer.
Expand input_args in the existing runtime::Metadata to parity with runtime::FunctionInfo, plus include _sizes from tvm::relay::transform::FunctionInfoNode and the required shape and dtype information from the beginning of this section.

Introduce ModelMetadataModule to contain this information for use with the C++ runtime.

class ModelMetadataModule {
  virtual GetFunction(const std::string& name, ObjectPtr<Object>& sptr_to_self) {
    if (name == "get_model_metadata") {
       return PackedFunc([](TVMArgs args, TVMRetValue* rv) {
          *rv = ModelMetadata(metadata_);
       });
    } else {
      return PackedFunc();
    }
  }

  const struct ModelMetadata* metadata_;
};

Introduce an optional implementation for the C runtime.
Export runtime::Metadata to Model Library Format.

The new proposed definition of runtime::Metadata is as follows. NOTE that this is a C definition because it will be made available both the C and C++ runtimes. A C++ wrapper will be written.

struct ParameterInfo {
  const char* relay_name_hint;
  const char* tir_name_hint;
  int64_t* shape;
  int64_t ndim;
  DLDataType dtype;
  TargetDevice target_device;  // NOTE: future addition; not covered in this RFC.
};

struct FunctionInfo {
  const char* function_name;
  struct ParameterInfo* params;
  int num_inputs;
  int num_outputs;
  int64_t workspace_size_bytes;
  int64_t io_size_bytes;
  int64_t constant_size_bytes;
};

typedef struct Metadata {
  int version;
  struct FunctionInfo* functions;
  const char* module_name;
};

Internal workings of AotExecutor (`--runtime=c++ --interface-api=packed`)

Given the above, we can now sketch out the way AotExecutor should behave (for C++ runtime).

Module initialization will:

Load the ModelMetadata using get_model_metadata PackedFunc.
Allocate space for the parameters to tvmgen_<model_name>_run_model.
Lookup and load any linked parameters using the --link-params mechanism.

set_input, get_input, get_output all work as they do in GraphExecutor.
run assembles TVMArgs containing inputs + outputs and invokes tvmgen_<model_name>_run_model.
time_evaluator is implemented in the same way as it is in GraphExecutor. Timing run_model is done using the CPU timer.

Internal workings of AotExecutor (`--runtime=c --interface-api=packed`)

The C runtime version works in a very similar way with C accessor functions for the ModelMetadata.

No AotExecutor implementation planned (`--runtime=c --interface-api=c`)

When -interface-api=c is present in the Target string, the run_model function no longer accepts the PackedFunc interface and instead accepts arg_values directly as positional args:

TVM_DLL int32_t tvmgen_default_run_model(void* arg0, void* arg1, void* arg2) {
  void* input = arg0;
  void* input1 = arg1;
  void* output = arg2;
  (void)tvmgen_default_fused_multiply(input, input1, output);
  return 0;
}

Additional work is underway to wrap this in a firmware-friendly interface. A core design goal of this interface is to offload all memory management tasks to the calling code to facilitate integration with bare-metal embedded devices.

Therefore, it would go against the goals of the C interface to introduce a generic runtime wrapper compatible with PackedFunc calling convention. It may be possible to do so in the future, but it would be great to motivate such an implementation with rationale more related to the embedded runtime setting.

Operator Calling Convention

TVM uses 3 internal calling conventions:

call_packed - the traditional calling convention used in the C++ runtime
call_cpacked - similar to call_packed, but TVM presumes a symbol is linked into the binary containing that function name (e.g. TVMBackendGetFuncFromEnv is not used to lookup the PackedFunc)
unpacked - used with microTVM to avoid overhead of PackedFunc calls in statically-linked binaries. See AOT optimisations for Embedded Targets RFC.

The AOT run_func can use a different calling convention externally (e.g. --interface-api) than that used internally with Implemented Operators (--unpacked-args). However, there are some circumstances under which not all choices can be used:

When targeting the C++ runtime: call_packed must be used when non-DSO-exportable modules exist; otherwise call_cpacked may be used. unpacked may not be used with AOT Executor as the interface has not settled.
When targeting the C runtime: any calling convention may be selected for either the interface API or the operator calling convention. However, when using --interface-api=c (e.g. unpacked run_func calling convention), you must also use the unpacked calling convention with Implemented Operators.

Drawbacks

Why should we not do this?

This requires quite a bit of rework of the Metadata-passing mechanism, with potential for breakage.
It also introduces yet another Executor to the runtime to maintain.
It may introduce additional constraints on the <C-runtime, C-interface> implementation, which may make it more difficult to make progress on microTVM.

Rationale and alternatives

Why is this design the best in the space of possible designs?
What other designs have been considered and what is the rationale for not choosing them?
What is the impact of not doing this?

This RFC doesn’t address the question of “why add an AOT executor?” The RFC which added it in the first place is a better location to look for rationale to motivate that. In general, not following through with this RFC would relegate the AOT executor to a C-runtime-only component. There is significant interest in AOT from C++ runtime users, and maintaining compatibility with both increases the chances that AOT executor will support all TVM runtime features.

The controversial pieces of this RFC addressed are as follows:

Should we maintain a unified approach to code-generating the AOT executor?

An alternative approach could introduce an additional e.g. aot_cpp_executor_codegen.cc and create a third pathway (in the Graph/AOT build flow). Doing this allows us to implement runtime-specific compiler primitives, which may simplify both pipelines. However, soon those pipelines will grow more complicated as features are added to leverage AOT, such as Unified Static Memory Planning. The burden of double-maintenance of those features outweighs the advantage of a simplified implementation. It also makes it easier for newcomers to understand the compiler.

Should we attempt to unify the Metadata?

Metadata could be left in the scattered form it is now. It may be that the implementation of this RFC prioritizes expansion of ModelMetadata over propagating it to the various non-DSO-exportable runtime::Module. Ultimately though, maintaining separate function-level metadata adds confusion and code bloat. It also makes it harder to reason about the compiler as a whole. For these reasons, this RFC advocates for centralizing the Metadata.

Prior art

There is no known prior art of a C+±runtime-compatible AOT implementation.

Unresolved questions

Who will we break if we unify Model metadata?
Will this play nicely with the VM compilation flow when it is unified?
How will TargetDevice come in to play here?

Future possibilities

Not covered in this RFC, but particularly useful with the C++ runtime, is heterogenous execution. In the present PoC (forthcoming), AotExecutor will CHECK-fail if a non-cpu device is given. A future implementation will annotate the parameters with one of:

A device_type — in which case mapping from device_type to tvm::Device will be done in the same way as the GraphExecutor
A target_device — in which case a new mapping will be defined

Aside from that, the larger unresolved bit which makes it difficult to add heterogenous execution is:

How should AOT codegen invoke the Device API?

Before this question can be answered, some progress needs to be made on the C device API and we need to define TIR bindings.

manupa-arm · September 27, 2021, 6:53am

Hi @areusch ,

Thanks for taking on the expansion of AoT into C++ runtime. It is indeed a much required expansion.

I have few question/concerns as follows :

Q1. Regarding the execution interface

aot_executor.set_input(“a”, tvm.nd.array(np.ndarray([1, 2], dtype=“uint8”)))

aot_executor.set_input(“b”, tvm.nd.array(np.ndarray([3, 5], dtype=“uint8”)))

aot_exec.run()

output = aot_exec.get_output(0)

a) Would this be thread safe ?

b) Do we need stateful setting of inputs, run and outputs ? (As opposed to something like follows :

aot_executor : AotExecutor = factory["my_mod"](tvm.cpu(0)) -- (Here the init is also run that creates the runtime.Modules using SaveToBinary artifacts)
output = aot_executor.run(a, b)

c) Would you be able to produce a non-pythonic example of using C++ runtime ? (showcasing the deployment flow that could start with the .so)

Q2. Why would we mandate the executor needs to allocate input and output tensors ?

Allocate input and output tensors as defined in the run_model function using the correct Device API

I could appreciate the fact that some of the tensors might need to be able to be placed in the private memories of the device prior to execution of fused operators on specific devices. However, that could happen just before that PrimFunc is executed while we could pass in pointers for the tensors in the host_target. Therefore, I am not sure we would want to do this always – maybe could be optional ?

Q3. Why would we want to allocate space for params ?

Module initialization will:

Load the ModelMetadata using get_model_metadata PackedFunc.

Allocate space for the parameters to tvmgen_<model_name>_run_model.

Lookup and load any linked parameters using the --link-params mechanism.

set_input, get_input, get_output all work as they do in GraphExecutor.

run assembles TVMArgs containing inputs + outputs and invokes tvmgen_<model_name>_run_model.

time_evaluator is implemented in the same way as it is in GraphExecutor. Timing run_model is done using the CPU timer.

I think we should not be allocating space for params by default unless we have a good reason. A user-override might be an acceptable solution.

Q4. Why are we loading metadata in the runtime (it feels like goes against the concepts of AoT) ?

Loading metadata as a runtime activity feels like we are conceptually going against the requirements of “Ahead-of-Time” compilation flow. The metadata should be presented to a user to integrate the application rather than being used in a runtime flow. Therefore, I believe any generated metadata should not be used from a special packed function however could be used between an entry_point (e.g. run()) and tvmgen_<model_name>_run_model .

I ll do a another pass w.r.t. to usage of non-DSO-exportable runtime.Modules.

areusch · September 27, 2021, 5:09pm

hi @manupa-arm,

thanks for your reply! i’ll answer your questions below:

a) Would this be thread safe ?

No. The current GraphExecutor implementation is also not thread-safe. In general from TVM C++ runtime we presume only a single frontend to be using libtvm_runtime.so at a time. I suspect many things will break if this assumption is violated–this is why e.g. to parallelize the unit tests, we propose to use pytest-xdist, as it spawns subprocesses.

b) Do we need stateful setting of inputs, run and outputs ? (As opposed to something like follows :
aot_executor : AotExecutor = factory["my_mod"](tvm.cpu(0)) -- (Here the init is also run that creates the runtime.Modules using SaveToBinary artifacts)
output = aot_executor.run(a, b) 

AOTExecutor certainly does not require this, however Module-based Model Runtime Interface implies this due to the semantics of set_input and get_output. In fact, run is internally implemented in Python-land to accept parameters (well, keyword args) like you gave them and then first call set_input.

I believe there are runtime settings, particularly when using accelerators from a traditional OS such as linux or Windows, in which allocation operations are unpredictably expensive (e.g. due to context-switching latency incurred in the malloc-like call). In these cases, it’s perceived to be better to do the allocation in advance, so that the steady state inference latency is more predictable. cc @tqchen @junrushao if they have more background here.

Anyhow, this RFC doesn’t seek to propose a change to MBMR–it is the standard interface exported through the PackedFunc ABI from libtvm_runtime.so.

Q3. Why would we want to allocate space for params ?

Module initialization will:

Load the ModelMetadata using get_model_metadata PackedFunc.

Allocate space for the parameters to tvmgen_<model_name>_run_model.

Lookup and load any linked parameters using the --link-params mechanism.

set_input, get_input, get_output all work as they do in GraphExecutor.

run assembles TVMArgs containing inputs + outputs and invokes tvmgen_<model_name>_run_model.

time_evaluator is implemented in the same way as it is in GraphExecutor. Timing run_model is done using the CPU timer.

I think we should not be allocating space for params by default unless we have a good reason. A user-override might be an acceptable solution.

Thanks–this was a typo. I meant “model inputs.” In some cases, these could include parameters e.g. if parameters weren’t linked or if the old-style --link-params is somehow in use.

Q4. Why are we loading metadata in the runtime (it feels like goes against the concepts of AoT) ?

In general, the traditional execution flow in TVM uses metadata at runtime fairly extensively. As you said, in general it should be possible to push in a direction where all of the metadata required inside the run_model call could be encoded in the generated code. As you said, we could broadly group runtime metadata uses into two buckets:

To help the user integrate the application code against the model. In the C runtime, such integration is always presumed to happen ahead-of-time. In the C++ runtime, this assumption isn’t common, and that’s why targeting micros was so much of a lift from the existing TVM codebase. This is also why the Artifact refactor is more obvious for the micro use case–the code-loading process is more exposed there, and such uses of metadata at runtime are more obvious. Metadatas 1 and 3 is used for this as well as the Graph JSON and the new metadata needed by AOTExecutor.
For use during inference. I’d say so far only metadata 2 above is used in this way–there is a case to be made for 1, but it’s more at module load time so I’d consider it to be separate.

Loading metadata as a runtime activity feels like we are conceptually going against the requirements of “Ahead-of-Time” compilation flow. The metadata should be presented to a user to integrate the application rather than being used in a runtime flow.

I agree with this entirely. I think that the metadata generated by the compiler should be considered a first-class participant in the Artifact code-loading process (which is designed to be compatible with both the existing C++ code-loading flow as well as be consumable in the C/microTVM land by a firmware engineer).

Therefore, I believe any generated metadata should not be used from a special packed function however could be used between an entry_point (e.g. run()) and tvmgen_<model_name>_run_model .

I think this is generally analogous to how metadata is being used today (the one case above excepted). I think there are also some other cases even in microTVM land where metadata can be useful at runtime. For instance, the ST port defines a fairly comprehensive amount of flash-based model metadata, mainly for use by the application. While I don’t think that the C runtime code-loading flow should require incorporating such metadata as e.g. a flash-based struct, I do think we should allow frameworks to provide it as such if needed.

manupa-arm · September 30, 2021, 7:06am

Hi @areusch ,

It would be great if we could outline which C++ runtime APIs that AoT executor would be using and why they are not threadsafe. It would be great if we can avoid FFI registration of SystemLib (that seems like the magic entry point GraphExecutor uses), which I think is an artificial constraint that blocks this feature, espcially in AoT, because we could cleanly define an entry point to the inference, unlike GraphExecutor.

I believe we could do better than GraphExecutor’s usage of C++ runtime APIs. As I noted before, this is a great expansion of AoT. Therefore, with this change would we go towards a world, where we can make the GraphExecutor more like a RPCExecutor – a standalone app that loads AoT artifacts? – thus removing a seperate compilation flow (GraphExecutorCodegen) from the core compiler where it leaves us with just two runtime flows.

cc : @jroesch

Are you saying that there would be an ‘run’ interface (possibly not stateful) that could simply take inputs and return output as well – additionally to MBMR?

Is the idea that in the application you do the set_input way before the run is called ? If so will the set_input be non-blocking call ?

In our experience, AoT could benefit from a simpler interface as well that have access to full-featured C++ runtime APIs but not necessarily dragging the FFI registerations. It would be great to see the motivation behind using MBMR over a simpler interface, especially if both interfaces are not going to be accessible by the user.

I think it is fair to say these are not strictly a requirement of MBMR and also might be requirement of a simple runtime API (just run).

Regarding (1), when you say the correct Device API, is it the host_target Device API ? Also, if the host target is going to execute the run, what about the alternative of application performing the allocation for inputs and outputs and passing in a pointer ?

Agree with 2) and 3).

This is indeed a good move!

areusch:

Introduce ModelMetadataModule to contain this information for use with the C++ runtime.

class ModelMetadataModule {
  virtual GetFunction(const std::string& name, ObjectPtr<Object>& sptr_to_self) {
    if (name == "get_model_metadata") {
       return PackedFunc([](TVMArgs args, TVMRetValue* rv) {
          *rv = ModelMetadata(metadata_);
       });
    } else {
      return PackedFunc();
    }
  }

  const struct ModelMetadata* metadata_;
};

Is the ModelMetadata going to include contents that would otherwise go to ModulePackImportsToC/LLVM ? – which are the contents that are generated via SaveToBinary(…).

Also is this the phase where non-DSO exportable runtime::Modules are created ?

areusch · September 30, 2021, 11:24pm

@manupa-arm thanks for your reply!

In general many of the TVM core APIs are not thread safe. Even the TIR implementation of the AOT run_func is not thread safe. Could you say more as to why you want to catalogue this now?

So here I do want to point out that although the C runtime is a parallel implementation of the C++ runtime (and this is no accident–it’s intended to support the TVM RPC layer), the MBMR implementation here is primarily intended to support the C++ runtime. In the C++ runtime, runtime.SystemLib is indeed one way to obtain a reference to a runtime::Module, but tvm.runtime.load_module() is the far-more-common way (this calls dlopen under the covers to recover a tvm::runtime::GraphExecutor instance from a .so produced by tvm.relay.build().export_library() in the C++ build flow. Right now, this mechanism is intended to provide an interface agnostic to the frontend language used with libtvm_runtime.so. I don’t disagree with you that a more efficient binding to the AOT entry point is possible to build. However, the benefits of using such higher-order languages (Python, Java, Go, Rust) often outweigh the efficiency losses, particularly when considering e.g. datacenter/microservice applications with more rapid deployment schedules and which may require libraries in those languages.

I do think that should the PackedFunc interface present too many inefficiencies, it is possible to either a) invoke the C interface being developed for microTVM, b) wrap said interface to be C+±friendly, or c) develop a similar c++ wrapping interface to run_func. However, initially I want to support the MBMR over PackedFunc to make it as easy as possible to use AOT; then we can focus on optimizing the speed.

Currently the Python wrapper provides a way to statefully set_params and run using a single Python-side call. While the frontend interface is built on top of MBMR and therefore a bit incidental to this RFC, it is a common interface known to C++ user base.

I agree statefully setting inputs and outputs isn’t strictly necessary and is potentially inefficient, but I guess I don’t think it’s necessary to tackle this in this RFC.

Well, MBMR has set_input, run, and get_output, so that implies some statefulness and internal memory, no? How do you get run to work without it?

Great question. AOT MBMR is not required to maintain state for input tensors in target_host’s Device API (e.g. CPU). If the first layer was run on a GPU, it seems like the AOT MBMR should just, inside set_input, copy the input tensor into the GPU. Note that I am however dodging the answer here by a) implementing this only for CPU Device API and b) not taking opinion on the word “correct” here.

I don’t disagree with this–but for the short term purpose of exposing AOT to C++, I want to meet the existing API supported by VM and Graph Executors so that it’s easy for people to try out and work with. We do have set_input_zero_copy in GraphExecutor and there’s no reason we couldn’t support this in AOT.

manupa-arm:

areusch:
Introduce ModelMetadataModule to contain this information for use with the C++ runtime.
class ModelMetadataModule {
virtual GetFunction(const std::string& name, ObjectPtr& sptr_to_self) { if (name == “get_model_metadata”) { return PackedFunc((TVMArgs args, TVMRetValue* rv) { *rv = ModelMetadata(metadata_); }); } else { return PackedFunc(); } }

const struct ModelMetadata* metadata_; };
Is the ModelMetadata going to include contents that would otherwise go to ModulePackImportsToC/LLVM ? – which are the contents that are generated via SaveToBinary(…).

Also great question. I haven’t quite settled this yet but I am now attempting to do a parallel implementation of CreateCSourceCrtMetadataModule and CreateLLVMCrtMetadataModule (but for the C++ runtime). Since TIR doesn’t support structs, we have to manually codegen this for now. I’ll come back with a better answer here in the tvm-rfcs PR.

Actually these are loaded when runtime.SystemLib or tvm.runtime.load_module is called.

areusch · October 6, 2021, 5:36am

[quote=“areusch, post:5, topic:11068”]

manupa-arm:

areusch:
Introduce ModelMetadataModule to contain this information for use with the C++ runtime.
class ModelMetadataModule {
virtual GetFunction(const std::string& name, ObjectPtr& sptr_to_self) { if (name == “get_model_metadata”) { return PackedFunc((TVMArgs args, TVMRetValue* rv) { *rv = ModelMetadata(metadata_); }); } else { return PackedFunc(); } }

const struct ModelMetadata* metadata_; };
[/quote]

Is the ModelMetadata going to include contents that would otherwise go to ModulePackImportsToC/LLVM ? – which are the contents that are generated via SaveToBinary(…).
Also great question. I haven’t quite settled this yet but I am now attempting to do a parallel implementation of CreateCSourceCrtMetadataModule and CreateLLVMCrtMetadataModule (but for the C++ runtime). Since TIR doesn’t support structs, we have to manually codegen this for now. I’ll come back with a better answer here in the tvm-rfcs PR.

In prototyping this a bit more I’ve realized that at tension are the following things:

The desire to consolidate all the metadata being passed from compiler to runtime
The desire to keep as much code and especially RAM bloat out of the C runtime in the deployment setting

These two come in to tension because consolidating metadata creates a pattern which newcomers are likely to follow–they would add metadata to the central struct. This then could become a source of friction in code review. In reality, we really only care about consolidating the metadata at the tail end of the compiler to reduce duplication. If, after doing this, we inline metadata into the TIR where it makes sense to, that doesn’t seem like a big deal.

Then I realized we had in fact already done this: the linked_params (that is, the void* data portion) is a sort of metadata. PR 7988 introduced a way to inline that metadata. So I would then propose the following:

We continue down the path of consolidating metadata as specified in this pre-RFC. This RFC doesn’t impact the C runtime because none of the metadata being consolidated is really used there.
We generalize the mechanism adopted for linked_params, potentially by introducing a pass which inlines metadata where possible right at the end of the compilation flow, perhaps as tir.Constant.
Should metadata be required in a location which can’t be inlined (e.g. BYOC which produces a runtime::Module requiring metadata; or the current CUDAModule implementation), we “fallback” to the approach specified in this pre-RFC–except, that in the future we can implement get_model_metadata in TIR, but use a condensed copy of the metadata struct which removes unnecessary metadata. Doing this may require user-defined structs in TIR to be optimal; the point here is mainly to ensure there is a path forward that fits with the rest of the compiler.

@manupa-arm what do you think about this approach?