How to invoke a generated kernel in C++?

Hello, I’m curious about how to invoke a tvm generated kernel (in topi, or generated by auto-scheduler) in C++? (As in Tiramisu compiler, the codegen will generate a .o file to link with, then I’m able to invoke the tiramisu generated kernel as C++ API). For example, I use following code to find a optimal implementation of Conv2D(512, 512, 7x7, 3x3).

import os
import numpy as np
import tvm
from tvm import te, auto_scheduler, topi
from tvm.topi.testing import conv2d_nchw_python

@auto_scheduler.register_workload
def conv2d_layer(N, H, W, CO, CI, KH, KW, stride, padding):
    data = te.placeholder((N, CI, H, W), name="data")
    kernel = te.placeholder((CO, CI, KH, KW), name="kernel")
    bias = te.placeholder((1, CO, 1, 1), name="bias")
    conv = topi.nn.conv2d_nchw(data, kernel, stride, padding, dilation=1, out_dtype="float32")
    out = topi.nn.relu(conv + bias)
    return [data, kernel, bias, out]

target = tvm.target.Target("llvm")

N, H, W, CO, CI, KH, KW, stride, padding = 1, 7, 7, 512, 512, 3, 3, (1, 1), (1, 1)
task = auto_scheduler.SearchTask(
    func=conv2d_layer, args=(N, H, W, CO, CI, KH, KW, stride, padding), target=target
)
print(task.compute_dag)

log_file = "conv2d.json"
measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300)
tune_option = auto_scheduler.TuningOptions(
    num_measure_trials=10,
    runner=measure_ctx.runner,
    measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
    verbose=2,
)

task.tune(tune_option)
sch, args = task.apply_best(log_file)
del measure_ctx

print(tvm.lower(sch, args, simple_mode=True))

Then I want to use the generated kernel in C++ like this:

conv2d_layer(input.buffer(), weight.buffer(), bias.buffer(), out.buffer())

Is this possible?

Yes it is possible, see for example Deploy TVM Module using C++ API — tvm 0.8.dev0 documentation and tvm_deploy_gpu_sample.cpp · GitHub

OK, I will check this example. By the way, if I set(USE_OMP gnu) when I build my TVM, is that mean I can use omp_set_num_threads(x) in deploy code to limit the CPU core used by the TVM generated kernel?

No for that you need to use the environment variable TVM_NUM_THREADS or OMP_NUM_THREADS

Now I encounter a weird problem: I set the TVM_NUM_THREADS to 40, and in the wrapper code, I use a cpu mask and sched_setaffinity try to limit my CPU usage as following:

#include <dlpack/dlpack.h>
#include <tvm/runtime/module.h>
#include <tvm/runtime/packed_func.h>
#include <tvm/runtime/registry.h>
#include <sys/time.h>
#include <iostream>
#include <assert.h>

using namespace std;

float elapsed(struct timeval a, struct timeval b) {
    return 1000000.0 * (b.tv_sec - a.tv_sec) + 1.0 * (b.tv_usec - a.tv_usec);
}

int main(int argc, char** argv) {

    cpu_set_t mask;
    CPU_ZERO(&mask);
    int core_num = atoi(argv[1]);
    for (int i = 0; i < core_num; i++) CPU_SET(i, &mask);
    sched_setaffinity(0, sizeof(cpu_set_t), &mask);

    int warmup = 20;
    int iter = 1000;
    tvm::runtime::Module mod_dylib = tvm::runtime::Module::LoadFromFile("conv2d.so");
    tvm::runtime::PackedFunc f = mod_dylib.GetFunction("conv2d");
    assert((f != nullptr) && "function pointer not retrived!");

    DLTensor* data;
    DLTensor* weight;
    DLTensor* bias;
    DLTensor* output;
    int ndim = 4;
    int dtype_code = kDLFloat;
    int dtype_bits = 32;
    int dtype_lanes = 1;
    int device_type = kDLCPU;
    int device_id = 0;
    int64_t shape_data[4] = {1, 512, 7, 7};
    int64_t shape_weight[4] = {512, 512, 3, 3};
    int64_t shape_bias[4] = {1, 512, 1, 1};
    int64_t shape_output[4] = {1, 512, 7, 7};
    TVMArrayAlloc(shape_data, ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &data);
    TVMArrayAlloc(shape_weight, ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &weight);
    TVMArrayAlloc(shape_bias, ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &bias);
    TVMArrayAlloc(shape_output, ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &output);

    struct timeval st, ed;
    for (int i = 0; i < warmup; i++) f(data, weight, bias, output);
    gettimeofday(&st, NULL);
    for (int i = 0; i < iter; i++) {
        f(data, weight, bias, output);
    }
    gettimeofday(&ed, NULL);
    cout << elapsed(st, ed) / (1.0 * iter) << endl;
    
    TVMArrayFree(data);
    TVMArrayFree(weight);
    TVMArrayFree(bias);
    TVMArrayFree(output);
    return 0;
}

and when I execute, the execution time changes suddenly when the cpu mask equals TVM_NUM_THREADS , here’s the result:

>$ ./main 8
41429.9
>$ ./main 16
20690.9
>$ ./main 32
11686.2
>$ ./main 40  # Here CPU masks equals to TVM_NUM_THREADS
934.12

I’m curious about why this happens, as my assumption, if cpu affinity is equal to TVM_NUM_THREADS, every thread can bind a logic cpu, however, when cpu affinity is less, tvm thread need to switch the context, but will there be such a big overhead?

  • The reason I want to limit the CPU core usage is that I want to launch the kernel via MPI.

I tried GNU perf, and I have following findings:

  • This is for export TVM_NUM_THREADS=20
 Performance counter stats for './main':

         12,131.55 msec task-clock                #   22.565 CPUs utilized          
             1,516      context-switches          #    0.125 K/sec                  
                47      cpu-migrations            #    0.004 K/sec                  
            10,760      page-faults               #    0.887 K/sec                  
    32,609,316,421      cycles                    #    2.688 GHz                      (48.57%)
    50,750,525,424      instructions              #    1.56  insn per cycle           (60.59%)
     3,723,837,604      branches                  #  306.955 M/sec                    (60.97%)
         9,001,449      branch-misses             #    0.24% of all branches          (62.12%)
    15,037,970,259      L1-dcache-loads           # 1239.575 M/sec                    (63.23%)
        10,033,182      L1-dcache-load-misses     #    0.07% of all L1-dcache hits    (64.11%)
           915,100      LLC-loads                 #    0.075 M/sec                    (51.24%)
           403,573      LLC-load-misses           #   44.10% of all LL-cache hits     (49.76%)

       0.537622851 seconds time elapsed

       9.256950000 seconds user
       2.880826000 seconds sys
  • This is for export TVM_NUM_THREADS=40 and set affinity to CPU(0-19)
 Performance counter stats for './main 20':

         41,430.67 msec task-clock                #   20.238 CPUs utilized          
             9,508      context-switches          #    0.229 K/sec                  
             1,467      cpu-migrations            #    0.035 K/sec                  
            11,276      page-faults               #    0.272 K/sec                  
   116,523,683,462      cycles                    #    2.812 GHz                      (49.52%)
    63,758,475,858      instructions              #    0.55  insn per cycle           (61.89%)
     7,571,995,781      branches                  #  182.763 M/sec                    (62.02%)
        10,406,752      branch-misses             #    0.14% of all branches          (62.25%)
    17,267,303,342      L1-dcache-loads           #  416.776 M/sec                    (62.71%)
        17,789,100      L1-dcache-load-misses     #    0.10% of all L1-dcache hits    (63.03%)
         2,038,100      LLC-loads                 #    0.049 M/sec                    (50.41%)
           696,546      LLC-load-misses           #   34.18% of all LL-cache hits     (50.07%)

       2.047178665 seconds time elapsed

      37.869758000 seconds user
       3.577391000 seconds sys

The context switch and CPU migration of set affinity is nearly 10 to 50 times than set env var

TVM uses a custom thread pool that configures CPU affinity. I don’t recommend messing with affinity stuff from the application side.