Does Relax VM support multithread targeting on x86 CPU?

Ghost381937 · October 13, 2024, 1:17pm

When I profile with htop or other profiling applications (like VTune), the parallelism is close to 1, unlike the Relay VM, which shows performance closer to the number of threads. Here is my code:

import tvm
from tvm import relax
import torch
import numpy as np
from torch import fx
from tvm.relax.frontend.torch import from_fx
from torchvision.models.resnet import ResNet18_Weights, resnet18
import os
import tempfile
num_threads = 16
os.environ["TVM_NUM_THREADS"] = str(num_threads)
device = tvm.cpu(0)
target = tvm.target.Target('llvm')

torch_model = resnet18(weights=ResNet18_Weights.DEFAULT)

# Give the input shape and data type
input_info = [((16, 3, 224, 224), "float32")]

# Convert the model to IRModule
with torch.no_grad():
    torch_fx_model = fx.symbolic_trace(torch_model)
    mod = from_fx(torch_fx_model, input_info)


ex = relax.build(mod, target=target)
vm = relax.VirtualMachine(ex, device=device)

gpu_data = tvm.nd.array(np.random.rand(16, 3, 224, 224).astype("float32"), device)
gpu_out = vm["main"](gpu_data).numpy()
print(gpu_out.shape)

Hzfengsy · October 13, 2024, 8:58am

For e2e build and optimization, please refer End-to-End Optimize Model — tvm 0.19.dev0 documentation

Ghost381937 · October 13, 2024, 2:37pm

Sure, but is there any scheduler for CPU multithreading?

Ghost381937 · October 14, 2024, 1:41am

I also noticed that the generated LLVM code from the Relay build version contains __TVMBackendParallelLaunch, but the Relax build version does not. I believe that __TVMBackendParallelLaunch is likely one of the key APIs for multithreaded execution.

Ghost381937 · October 14, 2024, 4:14am

I got the point, thank you.

Ghost381937 · October 20, 2024, 6:09am

How about dynamic shape model tuning? I tried it with meta scheduler, but It seems to not support dynamic shape loop. The following code encountered an segmentation fault: the variable extent is a null pointer.

github.com

apache/tvm/blob/031508394802a96090ada8314e9ef698a359a42d/src/tir/schedule/analysis/analysis.cc#L1583


Array<tir::StmtSRef> loops = tir::GetLoops(block_sref);
int64_t cum_space_len = 1, cum_reduce_len = 1;
/*
 * Return (-1, -1) if
 *   1. there is some loop with type other than kDataPar and kCommReduce;
 *   2. there is some loop which is dynamic.
 */
for (const tir::StmtSRef& loop_sref : loops) {
  tir::IterVarType type = GetLoopIterType(loop_sref);
  if (type == tir::kDataPar) {
    const int64_t* extent = GetLoopIntExtent(loop_sref);
    if (extent && *extent != -1) {
      cum_space_len *= *extent;
    } else {
      return std::make_pair(-1, -1);
    }
  } else if (type == tir::kCommReduce) {
    const int64_t* extent = GetLoopIntExtent(loop_sref);
    if (extent && *extent != -1) {
      cum_reduce_len *= *extent;
    } else {

Ghost381937 · October 21, 2024, 6:01am

The above problem about dynamic shape model tuning is solved, but another issue occurred from the following: InternalError: Check failed: (!rv_names->count(output)) is false: ValueError: The random variable has been produced once: _

github.com

apache/tvm/blob/031508394802a96090ada8314e9ef698a359a42d/src/tir/schedule/trace.cc#L227


  }
}


Array<String> TranslateAddOutputRVs(
    const Array<ObjectRef>& outputs,
    std::unordered_map<ObjectRef, String, ObjectPtrHash, ObjectPtrEqual>* rv_names) {
  Array<String> results;
  results.reserve(outputs.size());
  for (const ObjectRef& output : outputs) {
    int i = rv_names->size();
    ICHECK(!rv_names->count(output))
        << "ValueError: The random variable has been produced once: " << rv_names->at(output);
    String result{ObjectPtr<StringObj>{nullptr}};
    if (!output.defined()) {
      result = "_";
    } else if (output->IsInstance<BlockRVNode>()) {
      result = "b" + std::to_string(i);
    } else if (output->IsInstance<LoopRVNode>()) {
      result = "l" + std::to_string(i);
    } else if (output->IsInstance<VarNode>()) {
      result = "v" + std::to_string(i);

Hzfengsy · October 25, 2024, 2:57am

Current auto-tuning mechanism only supports static shape

jujuede · October 30, 2024, 10:53pm

How can I limit the number of CPU cores used by the tuned model in end-to-end compilation?

Now I try to config the target num-cores like:

target = tvm.target.Target(
      f"llvm -mtriple={tvm.target.codegen.llvm_get_system_triple()} -mcpu={tvm.target.codegen.llvm_get_system_cpu()} -num-cores={parallelism}")

and limit the max_jobs_per_core as 1 in auto tuning pass:

ms.schedule_rule.ParallelizeVectorizeUnroll(1, 16, None, Ture)
ms.mutator.MutateParallel(1)

But the compiled model running in Relax VM seems to use only one core. Did I miss anything?