Execution order of operators at Runtime in TVM

Dear All,

I am wondering how the execution order of operators is defined at runtime in TVM?
For example, in the following example, add1 and add2 are parallel, and how the TVM runtime schedules these on hardware? (Surely, it depends on target HW, but assuming we have a HW that its capable of executing parallel operators).

x1 = relay.var('x1', shape=(10,2))
y1 = relay.var('y1', shape=(10,2))
add1 = relay.op.add(x1,y1) 

x2 = relay.var('x2', shape=(10,2))
y2 = relay.var('y2', shape=(10,2))
add2 = relay.op.add(x2,y2) 

z =  add1*add2

I asked a similar question here, but I did not get the answer that I was looking for. Looking at the code for GraphRuntime, it looks like operators are executed one at a time here, it says the operators are executed one by one, but t DOES NOT make sense, how operators are executed ONE BY ONE When we have parallel hardware"

You can modify the code in graph runtime to get the fused_name just like this.

  // Get compiled function from the module that contains both host and device
  // code.
  tvm::runtime::PackedFunc pf = module_.GetFunction(param.func_name, true);
  CHECK(pf != nullptr) << "no such function in module: " << param.func_name;

  auto fexec = [param, arg_ptr, pf]() {
    printf("%s\n",param.func_name.c_str());
    TVMRetValue rv;
    TVMArgs targs(arg_ptr->arg_values.data(),
                  arg_ptr->arg_tcodes.data(),
                  static_cast<int>(arg_ptr->arg_values.size()));
    pf.CallPacked(targs, &rv);
  };
  return {fexec, arg_ptr};

Each subgraph has been made a fused function. So I guess the former pass bind the thread and it is not suggested to explore it in graph runtime. The graph runtime use an static memory allocation strategy and will be replaced by Relay VM.

1 Like

@hht – thank you very much.

So is this mean, we can not enforce parallelism to GraphRuntime? If I understand correctly, it looks like GraphRuntime does not run add1 and add2 in parallel?

Basically, I am wondering if there is a mechanism to enforce parallelism to GraphRuntime from the higher level passes from Relay/TVM? For example embedding scheduling (parallelism information) into the graph json file, and enforce that schedule into GraphRuntime?

Is that so Runtime won’t run independent subgraphs in parallel? Or only certain runtime won’t do so?

I’m new to TVM, but a quick search shows some infrastructure built-in:

/work/git_repo/tvm/src/runtime$ grep -R -i parallel *
crt/crt_backend_api.c:int TVMBackendParallelLaunch(FTVMParallelLambda flambda, void* cdata, int num_task) {
crt/crt_backend_api.c:  TVMParallelGroupEnv env;
library_module.cc:  TVM_INIT_CONTEXT_FUNC(TVMBackendParallelLaunch);
library_module.cc:  TVM_INIT_CONTEXT_FUNC(TVMBackendParallelBarrier);
micro/standalone/utvm_runtime_api.h:} TVMParallelGroupEnv;
micro/standalone/utvm_runtime_api.h:typedef int (*FTVMParallelLambda)(int task_id, TVMParallelGroupEnv* penv, void* cdata);
micro/standalone/utvm_runtime_api.h:TVM_MICRO_RUNTIME_API_BACKEND_API int TVMBackendParallelLaunch(FTVMParallelLambda flambda,
micro/standalone/utvm_graph_runtime.cc:  TVM_INIT_CONTEXT_FUNC(TVMBackendParallelLaunch);
micro/standalone/utvm_graph_runtime.cc:// TVM_INIT_CONTEXT_FUNC(TVMBackendParallelBarrier);
micro/standalone/utvm_runtime_api.cc:int TVMBackendParallelLaunch(FTVMParallelLambda flambda, void* cdata, int num_task) {
micro/standalone/utvm_runtime_api.cc:  TVMParallelGroupEnv env;
thread_pool.cc:class ParallelLauncher {
thread_pool.cc:  void Init(FTVMParallelLambda flambda,
thread_pool.cc:  ~ParallelLauncher() {
thread_pool.cc:  static ParallelLauncher* ThreadLocal() {
thread_pool.cc:    return dmlc::ThreadLocalStore<ParallelLauncher>::Get();
thread_pool.cc:  // The parallel lambda
thread_pool.cc:  FTVMParallelLambda flambda;
thread_pool.cc:  TVMParallelGroupEnv env;
thread_pool.cc:    ParallelLauncher* launcher;
thread_pool.cc:  int Launch(FTVMParallelLambda flambda,
thread_pool.cc:    ParallelLauncher* launcher = ParallelLauncher::ThreadLocal();
thread_pool.cc:        << "Cannot launch parallel job inside worker, consider fuse then parallel";
thread_pool.cc:          << "Request parallel sync task larger than number of threads used "
thread_pool.cc:      TVMParallelGroupEnv* penv = &(tsk.launcher->env);
thread_pool.cc:    ParallelLauncher::ThreadLocal()->is_worker = true;
thread_pool.cc:      TVMParallelGroupEnv* penv = &(task.launcher->env);
thread_pool.cc:int TVMBackendParallelLaunch(
thread_pool.cc:    FTVMParallelLambda flambda,
thread_pool.cc:  #pragma omp parallel num_threads(num_workers)
thread_pool.cc:    TVMParallelGroupEnv env;
thread_pool.cc:int TVMBackendParallelBarrier(int task_id, TVMParallelGroupEnv* penv) {
2 Likes

I guess there is no way to enforce parallelism to GraphRuntime. Parallelism only exists in the module. The GraphRuntime is designed to realize heterogeneous execution.

Module GraphRuntimeCreate(const std::string& sym_json,
                          const tvm::runtime::Module& m,
                          const std::vector<TVMContext>& ctxs) {
  auto exec = make_object<GraphRuntime>();
  exec->Init(sym_json, m, ctxs);
  return Module(exec);
}

I guess the json is compute graph, module is the packed function used in the compute graph, and context tells runtime to choose the correct device api.

std::pair<std::function<void()>, std::shared_ptr<GraphRuntime::OpArgs> > GraphRuntime::CreateTVMOp(
    const TVMOpParam& param,
    const std::vector<DLTensor>& args,
    size_t num_inputs) {
  std::shared_ptr<GraphRuntime::OpArgs> arg_ptr = std::make_shared<GraphRuntime::OpArgs>();
  // setup address.
  arg_ptr->args = args;
  if (param.flatten_data) {
    arg_ptr->shape_data.resize(arg_ptr->args.size());
  }
  for (size_t i = 0; i < arg_ptr->args.size(); ++i) {
    TVMValue v;
    DLTensor* t = &arg_ptr->args[i];
    v.v_handle = t;
    arg_ptr->arg_values.push_back(v);
    arg_ptr->arg_tcodes.push_back(kTVMDLTensorHandle);
    if (param.flatten_data) {
      arg_ptr->shape_data[i] = std::accumulate(
          t->shape, t->shape + t->ndim, 1, std::multiplies<int64_t>());
      t->ndim = 1;
      t->shape = &(arg_ptr->shape_data[i]);
    }
  }

  if (param.func_name == "__nop") {
    return {[](){}, arg_ptr};
  } else if (param.func_name == "__copy") {
    // Perform cross device data copy.
    // Directly copy data from the input to the output.
    auto fexec = [arg_ptr]() {
      DLTensor* from = static_cast<DLTensor*>(arg_ptr->arg_values[0].v_handle);
      DLTensor* to = static_cast<DLTensor*>(arg_ptr->arg_values[1].v_handle);
      TVM_CCALL(TVMArrayCopyFromTo(from, to, nullptr));
    };
    return {fexec, arg_ptr};
  }

  // Get compiled function from the module that contains both host and device
  // code.
  tvm::runtime::PackedFunc pf = module_.GetFunction(param.func_name, true);
  CHECK(pf != nullptr) << "no such function in module: " << param.func_name;

  auto fexec = [arg_ptr, pf]() {
    TVMRetValue rv;
    TVMArgs targs(arg_ptr->arg_values.data(),
                  arg_ptr->arg_tcodes.data(),
                  static_cast<int>(arg_ptr->arg_values.size()));
    pf.CallPacked(targs, &rv);
  };
  return {fexec, arg_ptr};
}

There is no strategy to enforce parallelism to the op_execs_. It just does what graph json tells.

1 Like

I guess the parallel actually works on modules.

int TVMBackendParallelLaunch(
    FTVMParallelLambda flambda,
    void* cdata,
    int num_task) {
#if !TVM_THREADPOOL_USE_OPENMP
  int res = tvm::runtime::ThreadPool::ThreadLocal()->Launch(
      flambda, cdata, num_task, 1);
  return res;
#else
  int num_workers = tvm::runtime::threading::MaxConcurrency();
  if (num_task == 0) num_task = num_workers;
  omp_set_num_threads(num_workers);
  #pragma omp parallel num_threads(num_workers)
  {
    TVMParallelGroupEnv env;
    env.num_task = num_task;
    (*flambda)(omp_get_thread_num(), &env, cdata);
  }
  return 0;
#endif
}

This function launches the thread. And it is invoked by module building process. Graph runtime may have nothing to do with paralelism.

1 Like

@hht – this is definitely interesting. In my given example, add1 and add2 are Op types, and thus, I’d expect them to be run in parallel in a HW that is capable of running two adders (ā€œ+ā€) in parallel.

@hht - thank you again. Now it makes a kind of sense. Could you please clarify what do you mean by ā€œParallelism only exists in the module.ā€? My understanding is that there is only one Module, and module contains multiple graph nodes that can run in parallel.

1 Like

Yes, you are right. Some key words are confusing.

I guess the module can be seen in some way as a hash map with func name as key and function implement as value. The following code can be found in graph_runtime_codegen.cc.

  LoweredOutput Codegen(relay::Function func) {
    auto pf = GetPackedFunc("relay.backend.GraphPlanMemory");
    storage_device_map_ = (*pf)(func);
    // First we convert all the parameters into input nodes.
    for (auto param : func->params) {
      auto node_ptr = GraphInputNode::make_node_ptr(param->name_hint(), GraphAttrs());
      var_map_[param.get()] = AddNode(node_ptr, param);
    }
    heads_ = VisitExpr(func->body);
    std::ostringstream os;
    dmlc::JSONWriter writer(&os);
    GetJSON(&writer);
    LoweredOutput ret;
    ret.graph_json = os.str();
    ret.params = params_;

    for (auto& kv : lowered_funcs_) {
      if (ret.lowered_funcs.count(kv.first) == 0) {
        ret.lowered_funcs.Set(kv.first, IRModule::Empty());
      }
      auto& mod = ret.lowered_funcs[kv.first];
      mod->Update(kv.second);
      ret.lowered_funcs.Set(kv.first, mod);
    }
    ret.external_mods = compile_engine_->LowerExternalFunctions();
    return ret;
  }
struct LoweredOutput {
  std::string graph_json;
  Map<std::string, IRModule> lowered_funcs;
  Array<tvm::runtime::Module> external_mods;
  std::unordered_map<std::string, tvm::runtime::NDArray> params;
};

And the IRModule shown above may have something to do with parallelism. And IRModule is something just like a lower function but not a module.

struct GraphCodegen {
 public:
  GraphCodegen() {
    auto pf = GetPackedFunc("relay.build_module._GraphRuntimeCodegen");
    mod = (*pf)();
  }
...

In build_module.cc, the lowered_funcs have been transformed into module.

class BuildModule(object):
    """Build an IR module to run on TVM graph runtime. This class is used
    to expose the `RelayBuildModule` APIs implemented in C++.
    """
    def __init__(self):
        self.mod = _build_module._BuildModule()
    ...
def build(self, mod, target=None, target_host=None, params=None):
        target = _update_target(target)

        # Setup the params.
        if params:
            self._set_params(params)
        # Build the IR module
        self._build(mod, target, target_host)
        # Get artifacts
        graph_json = self.get_json()
        mod = self.get_module()
        params = self.get_params()

And when we invoke relay.build we would get graph_json, mod, params. We use graph_json, mod and ctx to create the graph runtime.

2 Likes

It’s simple: We support only intra-operator parallelism, not inter-operator parallelism. We use threads for paralizing the outer most loop of convolution, for example.

3 Likes

Thank you @hht

This is very useful. I have two follow up questions.

  1. what is the purpose of external_mods in the LoweredOutput structure?

  2. Ia m wondering if I can get more details about how the CodeGen in TVM works? I mean what is the sequence. I know it starts from Relay, and I am not sure if I understand the rest of the steps that lowers the Relay IR to target HW object code?

Thank you @masahi This is very helpful. But, I am more than puzzled.

Let us say you have two HW units of capable of running (the original example add1 and add2). So according to your answer, the add1 and add2 CAN NOT run in parallel? Could you provide some insights and on this? Also provide a link to source code that prevents does this sequential running?

Yes two ops, even if they are independent, are run sequentially. This is the code that executes operators:

If you have a custom HW and you are interested in inter-op parallelism, you should be looking at our external codegen mechanism aka Bring your own codegen, BYOC. See https://docs.tvm.ai/dev/relay_bring_your_own_codegen.html

Basically, the main runtime is still TVM, but you can cut subgraphs and send them to your codegen/runtime. There you can execute them anyway you like. As long as you get the interface between TVM and your codegen/runtime right, TVM doesn’t need to know how you execute subgraphs.

2 Likes

I guess external_mods is the mods need external codegen tools. The official document shows the codegen sequence. https://docs.tvm.ai/dev/codebase_walkthrough.html

  /*!
   * \brief Lower the external function using external codegen tools.
   * \return The runtime moduels for each needed external codegen tool.
   */
  virtual tvm::Array<tvm::runtime::Module> LowerExternalFunctions() = 0;
1 Like

It seems that annotate_target tutorial has not been uploaded. I am wondering how one runtime communicates with another runtime. Something like a host runtime and a kernel runtime?Host runtime uses jit to generate kernel’s code(just like vta)? Or they use other synchronization mechanisms? å¾®äæ”ęˆŖå›¾_20200506192915 Maybe it is serial. Just copy from one device to another device and switch runtime.

def @main(%a: Tensor[(10, 10), float32], %b: Tensor[(10, 10), float32], %c: Tensor[(10, 10), float32], %d: Tensor[(10, 10), float32]) -> Tensor[(10, 10), float32] { 
%0 = add(%a, %b) /* ty=Tensor[(10, 10), float32] */; 
%1 = multiply(%c, %d) /* ty=Tensor[(10, 10), float32] */; 
%2 = device_copy(%1, meta[relay.attrs.DeviceCopyAttrs][0]) /* ty=Tensor[(10, 10), float32] */; 
subtract(%0, %2) /* ty=Tensor[(10, 10), float32] */ 
}
// meta data omitted. you can use show_meta_data=True to include meta data

See DNNL example below. Since TVM runtime is sequential, there is no synchronization of any kind. You just deal with pointers to tensors via DLTensor

https://github.com/apache/incubator-tvm/tree/master/src/relay/backend/contrib/dnnl

2 Likes