In collaboration with @tqchen
See also: PoC
Overview
In RAM-limited deployment scenarios (i.e. µTVM), it’s desirable to place as much constant data as possible in a separate binary section and use it directly from that section. To that end, this RFC proposes a way for TVM to include pre-linked parameters in generated runtime::Module
.
Depending on the target and available codegen, the solution to this problem could be quite expansive. For example, some architectures could benefit from a specific way of encoding parameters, while others may prefer to encode parameters for consumption by specific hardware accelerators. This RFC doesn’t aim to preclude future work in those directions, but in the interest of forward progress, we constrain our goal to simply removing the need for GraphRuntime to allocate RAM space for parameter tensors used by the tvm.cpu()
contexts. Only the c
and llvm
codegens are considered here. At the end, some future directions are discussed.
Challenges
There are several challenges to be solved here:
C1. Indicating to the Relay compiler that the user wants to enable this feature
C2. Passing the set of parameters from GraphRuntimeCodegen to the target-specific codegen.
C3. Loading linked parameters at runtime
We start from the end and work backwards.
C3. Loading Linked Parameters at runtime
Parameters can be stored either separately or as a single binary blob. Following are some storage schemes considered:
S1. The data
field of each parameter’s DLTensor
is stored as a symbol named __tvm_param__pN
, pN
corresponds to the parameter’s name after passing through GraphRuntimeCodegen
.
S2. Similar to S1, but also include the DLTensor
.
S3. Place parameters in module Metadata.
S3 is most compatible with the existing codegen, but it has these disadvantages:
- Since parameters are encoded as a single metadata blob, traditional binary size analysis (i.e. objdump, nm) will just report the size of the metadata blob instead of size per parameter.
- Parameters can’t be pinned in memory or assigned to specific sections (unless the entire metadata blob fits in the desired section).
- At runtime, parameter pointers will be initially encoded as offsets into the metadata blob, requiring knowledge of the metadata layout at debug time.
S2 is the easiest to reason about logically (i.e. a DLTensor
object is a concept that users are likely to understand). However, doing this would require encoding the DLTensor
struct layout into each codegen, which could become hard to maintain. It’s also overkill, since DLTensor metadata are stored in the JSON graph given to the GraphRuntime and also sent over RPC.
S1 provides the benefit of linked parameters without much overhead.
Schemes S1 and S2 don’t specify how parameters are looked-up at runtime. We now consider this problem. At run time, GraphRuntime
knows the string name
and integer storage_id
of each parameter. Either of these can be used to identify the tensor to be loaded (in some cases, GraphRuntime
reuses storage_id
between tensors, but it does not do this for parameters). The linked parameter load process can then be thought of as a function that accepts this identifier and returns a DLTensor*
or NDArray
(depending on C or C++ runtimes) whose data
field points to the pre-loaded parameter array.
This function could be implemented in a few different ways:
F1. Each model runtime could accept a standard data structure mapping storage_id
to void* data
.
F2. Each model runtime could invoke a function in the TVM system runtime (i.e. CRT or C++ runtime) to do the same lookup as in F1.
F3. Each generated module could expose a standard function __lookup_linked_param
.
F4. Each system runtime could load parameters given a standard data structure mapping model name and parameter string name to void*
and then invoke SetParam
on the model runtime.
F4 is difficult to implement, because the model name and parameter name lookup are more complex, more expensive, and the API to set parameters (i.e. TVMSetModelParameters(Module* m, const char* model_name, void* param_mapping)
) is harder for the user to invoke. It’s also difficult to be made automatic, because TVM runtime has limited knowledge of when a new model-specific TVM Module is instantiated.
F2 suffers from a similar complexity problem (needing to key both on storage_id
and model_name
).
F1 is simple, but the data structure is not as easy to generate as it might seem. storage_id
is not contiguous over the set of parameters, so the best implementation is as a list of pairs. This is awkward to work with and slow. Additionally, user code would need to separately keep track of this list and provide it to the model runtime to load parameters.
F3 is the best compromise—while no data-driven map exists, it offloads the lookup speed optimization onto the compiler via a switch statement. It also provides hardware-accelerated loaders a chance to execute any initialization code needed at parameter load time, such as waiting for backgrounded DMA transfers or decompression/decryption to complete. While this RFC doesn’t consider heterogeneous execution contexts, this choice doesn’t preclude their use at a later time.
In summary, the llvm
and c
codegen will generate an additional PackedFunc __lookup_linked_param
in the generated runtime::Module
which accepts a unique integer id
identifying the parameter and returns a void*
which should populate that the DLTensor data
member for that parameter.
C2. Passing parameters from Model-level to Target-level Codegen
Now that the job of the codegen is clear, the next challenge is passing parameters from model-level to target-level. Because the target-level codegen needs to include a new Module function, and the C runtime cannot rely on dynamic lookup such as dlsym
, parameters need to be included in the same module as the generated functions.
However, at present, TVM is not guaranteed to invoke a target-level codegen for every model. It’s possible that trivial models (i.e. p0 + p1
) may be fully realized at compile time, and an empty module will be returned. This can also happen when all functions are offloaded to accelerators.
Because of this, when linked parameters are generated, BuildRelay
emits an additional function: the __lookup_linked_param
. At present, this function contains no TIR code—the target-specific codegen is expected to provide an implementation. However, it attaches the parameters for the given modules as an attribute tir.linked_params
.
When the target-specific codegen sees this function and sees that linked parameters are included, it translates those parameters’ data into static const
arrays and outputs the __lookup_linked_param
implementation. This provides one global symbol per parameter, easing the task of analyzing binary bloat.
This approach is somewhat hacky because outside of the metadata module, TVM has no approach for including model-specific constant blobs. Since we prefer to avoid the metadata module due to aforementioned linking concerns, we feel it’s best to avoid defining another generic model-level blob packager until more examples appear.
C1. Enabling Linked Parameters
Linked parameters could be enabled a number of different ways:
W1. By marking each parameter with a special attribute. Each parameter with the attribute would be linked.
W2. With a target flag --link-params
.
W3. With an additional parameter to relay.build
.
W4. With a PassContext option.
W1 gives the finest-grain control, but is complex because the generated parameters may differ from those passed to [relay.build](<http://relay.build>)
due to parameter simplification. It may be worth revisiting this approach when heterogeneous execution is considered.
W2 is the simplest, but it does mean that linked parameters require different autotuning schedules. It’s not clear whether this is a good or bad thing; for µTVM, parameter access time may differ when loading from flash vs RAM, so separating the autotuning schedules is actually desirable.
W3 is a fairly high-level API change for such a specific feature. It also means that, unlike W2, that parameter is not propagated to target-level codegens. Those codegens then need to rely on other ways (i.e. checking for presence of __lookup_linked_param
TIR function) to identify a linked parameter situation.
W4 is a reasonable choice, but would not invalidate autotuning schedules and is a bit odd since at present, linked parameters are not implemented as a TIR pass. One could envision the implementation moving into a TIR pass, though, so it’s up for debate.
Future Directions
This RFC doesn’t tackle a number of challenges with pre-linking parameters, such as:
- Specifying a section for parameters
- Pinning each parameter to a specific memory location
- Supporting heterogeneous execution scenarios (i.e. offloading some parameters to BYOC)
In the future, additional configuration may be needed per parameter (i.e. section specifications, specific address pinning, etc). This could be done by expanding the LinkedParamNode
class implemented in the PoC PR. It may be desirable to instead place this as an IRModule-level attribute. In a world where some parameters are linked using external BYOC codegen, parameters could be either omitted or better marked as such using LinkedParamNode
.