TVM multithreading on LLVM backend

SerenaC94 · July 31, 2019, 11:40am

Hi,

I see that in the LLVM output code (obtained with lib.get_source() after building a model) there are calls to @__TVMBackendParallelLaunch. I assume this function is defined somewhere in the TVM runtime and it handles some kind of parallelization/multithreading.

Is there any documentation about this? Does anyone know where I could start looking?

wweic · August 1, 2019, 12:41am

The parallel launch API is defined in tvm runtime api:

github.com

dmlc/tvm/blob/3515dccccbcfe814f3e4cec14fa5afc77777e49e/src/runtime/thread_pool.cc#L393-L400


int TVMBackendParallelLaunch(
  FTVMParallelLambda flambda,
  void* cdata,
  int num_task) {
int res = tvm::runtime::ThreadPool::ThreadLocal()->Launch(
    flambda, cdata, num_task, 1);
return res;
}

Here(codegen_cpu) is how calls to this API gets emitted:

github.com

dmlc/tvm/blob/3515dccccbcfe814f3e4cec14fa5afc77777e49e/src/codegen/llvm/codegen_cpu.cc#L867-L868


} else if (op->attr_key == "pragma_parallel_launch_point") {
  CreateParallelLaunch(op->body, 0);

SerenaC94 · August 1, 2019, 2:37pm

Thank you!
I have been looking at the code, and there is something I can’t find: it seems here

github.com

dmlc/tvm/blob/3515dccccbcfe814f3e4cec14fa5afc77777e49e/src/runtime/thread_pool.cc#L316


  CHECK_LE(num_task, num_workers_used_)
      << "Request parallel sync task larger than number of threads used "
      << " workers=" << num_workers_used_ << " request=" << num_task;
}
launcher->Init(flambda, cdata, num_task, need_sync != 0);
SpscTaskQueue::Task tsk;
tsk.launcher = launcher;
// if worker0 is taken by the master, queues_[0] is abandoned
for (int i = exclude_worker0_; i < num_task; ++i) {
  tsk.task_id = i;
  queues_[i]->Push(tsk);
}
// use the master thread to run task 0
if (exclude_worker0_) {
  TVMParallelGroupEnv* penv = &(tsk.launcher->env);
  if ((*tsk.launcher->flambda)(0, penv, cdata) == 0) {
    tsk.launcher->SignalJobFinish();
  } else {
    tsk.launcher->SignalJobError(tsk.task_id);
  }
}

that each worker receives the same task. Is this correct?
I have to assume then that somewhere else there is a mechanism to split the input data so that each worker performs the same operations on a different set of data. Where would that be defined?

wweic · August 1, 2019, 3:58pm

I think this section creates the parallel lambda, and it uses task_id to grab its assigned portion of the data:

github.com

dmlc/tvm/blob/b9544d78fc381432c3c2448e2d806980e3bac617/src/codegen/llvm/codegen_cpu.cc#L515-L538


// Setup the closure function.
BasicBlock *lambda_entry = BasicBlock::Create(*ctx_, "entry", f);
builder_->SetInsertPoint(lambda_entry);
auto it = f->arg_begin();
llvm::Value* task_id = &(*it++);
llvm::Value* penv = &(*it++);
cdata = builder_->CreatePointerCast(&(*it++), cdata->getType());
// setup new variable map, swap it with current var context.
std::unordered_map<const Variable*, llvm::Value*> new_vmap;
UnpackClosureData(cdata, vfields, &new_vmap);
// setup parallel env
ParallelEnv par_env;
par_env.task_id = Var("task_id", Int(32));
par_env.num_task = Var("num_task", Int(32));
new_vmap[par_env.task_id.get()] = task_id;
new_vmap[par_env.num_task.get()] = builder_->CreateLoad(
    builder_->CreateInBoundsGEP(
        penv, {ConstInt32(0), ConstInt32(1)}));
par_env.penv = penv;
std::swap(function_, f);

This file has been truncated. show original

SerenaC94 · December 10, 2019, 10:01am

Is it possible to disable the parallelization? Is it included maybe in one of the “optimization levels”?

masahi · December 10, 2019, 12:03pm

You can set TVM_NUM_THREADS to 1.

SerenaC94 · December 10, 2019, 12:25pm

Setting NUM_THREADS still causes the code generator to create the call to @__TVMBackendParallelLaunch, which is what I want to avoid.

(I know I shouldn’t, but I am trying to get rid of TVMRuntime so I need the LLVM IR to be as clean as possilble)

harishch4 · June 22, 2022, 7:00am

I’m looking for something similar, were you to find a solution to this?