[C/C++ runtime] multimodel support

In scenarios where multiple models are used back to back, with multiple inputs and outputs, TVM doesn’t produce helpful native libraries to connect them:

  • get_num_inputs() returns all tensors instead of only the inputs of the model
  • get_output(id) has no support for strings. And since output names are mangled, it’s unclear what an id corresponds to which output.
  • as mentioned in the topic “multithreading and TVM runtime”, there seems to be an issue with the module factory shared between threads. In a multimode scenario, each model runs under different threads and caching the module factory doesn’t work, forcing each thread to recreate it, which incurs some performance hit.
  • while a secondary goal, the names of operators in the graph can be many characters long, where a simple integer would suffice.
  • also a secondary goal, parameters saved in a library are uncompressed. When saved separately and compressed with even a simple gzip, quite a lot of space can be reclaimed.

What we need

  • get_num_inputs() to return only inputs of the model,
  • get_num_params() to return only parameters/weights,
  • preserve output nodes names and so get_output(name) works,
  • make sure 2 models running in their own thread can cache their module factory at setup time and reuse PackedFuncs as fast as possible,
  • replace parameter names with integers,
  • provide an option to compress parameters’ tensors, especially when stored in the same library, even a default gz or LZ4 saves lots of space, and more dedicated methods could be provided by users.

These would be extensions of the existing code as (most of) this information is already available in graph runtime, for example. I’m not sure if there are impacts on the rest of the codebase.

What do you think?

1 Like

I think we want to dissect these points a bit:

F0: multi-model support

The support for multiple model loading is discussed and resolved as part of module based runtime interface [DISCUSS] Module based Model Runtime Interface. Although the current implementation @FrozenGene might only handle single model case in the initial impl, should not hard to add additional support.

F1: compress the binary blob

This should not be a hard feature to add (optionally), by adding a protocol flag to the serialized data. It would introduce a runtime dependency on zlib that may not always be available on embedded devices. The main thing is to keep backward compatibility.

F2: simplify the operator name

There can be pros and cons in this, notably the operator names are directly tied to the function names themselves and can be useful for debugging. So it may not be a good idea to simplify the names.

F3: multi-threading setups

This is something that worths some more thoughts. Since depending on the caller(want to control threading vs leave threading to tvm runtime), the platform(embedded vs server), the best solution can be different.

Right now the graph runtime is assumed to be used in a thread local fashion, as the local memory are pre-allocated. There can be some opportunities in sharing the parameters(but not the activations) among the executors. The main question on F3 is not about the possibility of optimizations, but about how to do it.

Since different ways of thread-safety model can affect user interface.

  • The stateful set/get/run interface is still the most natural one under minimum resource requirements. And likely we will like to keep it for embedded. This will mean however, that the graph runtime itself is stateful (since set can affect run). For multi-threaded settings, user cache an executor in TLS.
  • Alternatively a predict API can be made fully stateless, but would introduce additional dependency on Tuple for multiple output and might optionally depend on dynamic allocation.

Thanks for splitting the proposal.

Replying about F1 Yes we can generate multiple libraries but the issue is linking them together. Specifically, there is no way to differentiate inputs vs params/weights. There is no way to know the name of the outputs as they have been mangled after simplification.

About F2 Yes it may be useful for debugging but once release I don’t see any need to keep long names. At release, all one needs are inputs (not params) and outputs names.

On the parameter spliting etc. Take a look at module based runtime proposal in F0 and see if it meet most of the need. The ideal is that the factory module will create separate runtimes for each of the model. Rather than a single executor for all models (so you don’t have to differentiate between output of each model).

It is harder to name the output, as while input are parameters of the function(thus can be named), the output is not and from the functional module it is a tuple(and not named by default), in the same order of the original function.

The names in F1 is related to the codegen phase(symbol name in the libs), so it might be possible to rename and remap them by adding an additional renaming pass. However, that might need some additional works, as it can affect both the relay phase and lowering.

For the F1, current design is simply to add multi model support (in the previous pr I even implemented draft multi model support to verify current design) , even on different ctxs. But the issue is the unique compiled name as @tqchen described, we could evaluate and discuss whether we should do this. This could be started one new thread.

A bit of interruption on this discussion thanks to the awesome TVM conf last week!

On F0: named output tensors

I made some progress in getting outputs named.

  • the issue is first with Relay IR: functions return 1 output. If there are multiple outputs, it’s a tuple
  • Each returned tuple element is a tensor (DLTensor) and has no name or id
  • However, when a model is imported from TF, Pytorch, and so on, the outputs have names that are discarded.

So I modified the parsers to get the right mapping: name to output tuple element. And whatever from_<framework>() front-end method now returns mod, params, output_names. If you find this useful, I can make a PR, let me know.

This is sufficient for connecting a model’s outputs to another model’s inputs e.g. using a streaming framework.

But this keeps the metadata separate from the library generated by TVM. It would be “nicer” if it was embedded inside the library e.g. when we call get_output(n) in the returned DLTensor.

A possible solution: named tensors (DLTensor, NDArray)

On F0: real inputs

While TVM workflow maintains separation between function inputs and params, at compile time, they are merged as inputs. We should keep get_inputs() to return functions inputs. And add to the runtime get_params().

On F2

The current code limits the compiled function to 80 characters and replaces it with a hash string otherwise.

I don’t see any need for either. An integer id is sufficient and way smaller. For debugging purpose, a map file could be generated with much better information about what the transformations done to make a compiled function.

Thanks.