How to invoke a generated kernel in C++?

SubjectNoi · April 22, 2021, 7:48am

Hello, I’m curious about how to invoke a tvm generated kernel (in topi, or generated by auto-scheduler) in C++? (As in Tiramisu compiler, the codegen will generate a .o file to link with, then I’m able to invoke the tiramisu generated kernel as C++ API). For example, I use following code to find a optimal implementation of Conv2D(512, 512, 7x7, 3x3).

import os
import numpy as np
import tvm
from tvm import te, auto_scheduler, topi
from tvm.topi.testing import conv2d_nchw_python

@auto_scheduler.register_workload
def conv2d_layer(N, H, W, CO, CI, KH, KW, stride, padding):
    data = te.placeholder((N, CI, H, W), name="data")
    kernel = te.placeholder((CO, CI, KH, KW), name="kernel")
    bias = te.placeholder((1, CO, 1, 1), name="bias")
    conv = topi.nn.conv2d_nchw(data, kernel, stride, padding, dilation=1, out_dtype="float32")
    out = topi.nn.relu(conv + bias)
    return [data, kernel, bias, out]

target = tvm.target.Target("llvm")

N, H, W, CO, CI, KH, KW, stride, padding = 1, 7, 7, 512, 512, 3, 3, (1, 1), (1, 1)
task = auto_scheduler.SearchTask(
    func=conv2d_layer, args=(N, H, W, CO, CI, KH, KW, stride, padding), target=target
)
print(task.compute_dag)

log_file = "conv2d.json"
measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300)
tune_option = auto_scheduler.TuningOptions(
    num_measure_trials=10,
    runner=measure_ctx.runner,
    measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
    verbose=2,
)

task.tune(tune_option)
sch, args = task.apply_best(log_file)
del measure_ctx

print(tvm.lower(sch, args, simple_mode=True))

Then I want to use the generated kernel in C++ like this:

conv2d_layer(input.buffer(), weight.buffer(), bias.buffer(), out.buffer())

Is this possible?

masahi · April 22, 2021, 8:25am

Yes it is possible, see for example Deploy TVM Module using C++ API — tvm 0.8.dev0 documentation and tvm_deploy_gpu_sample.cpp · GitHub

SubjectNoi · April 22, 2021, 8:33am

OK, I will check this example. By the way, if I set(USE_OMP gnu) when I build my TVM, is that mean I can use omp_set_num_threads(x) in deploy code to limit the CPU core used by the TVM generated kernel?

masahi · April 22, 2021, 8:51am

No for that you need to use the environment variable TVM_NUM_THREADS or OMP_NUM_THREADS

SubjectNoi · April 23, 2021, 2:02am

Now I encounter a weird problem: I set the TVM_NUM_THREADS to 40, and in the wrapper code, I use a cpu mask and sched_setaffinity try to limit my CPU usage as following:

#include <dlpack/dlpack.h>
#include <tvm/runtime/module.h>
#include <tvm/runtime/packed_func.h>
#include <tvm/runtime/registry.h>
#include <sys/time.h>
#include <iostream>
#include <assert.h>

using namespace std;

float elapsed(struct timeval a, struct timeval b) {
    return 1000000.0 * (b.tv_sec - a.tv_sec) + 1.0 * (b.tv_usec - a.tv_usec);
}

int main(int argc, char** argv) {

    cpu_set_t mask;
    CPU_ZERO(&mask);
    int core_num = atoi(argv[1]);
    for (int i = 0; i < core_num; i++) CPU_SET(i, &mask);
    sched_setaffinity(0, sizeof(cpu_set_t), &mask);

    int warmup = 20;
    int iter = 1000;
    tvm::runtime::Module mod_dylib = tvm::runtime::Module::LoadFromFile("conv2d.so");
    tvm::runtime::PackedFunc f = mod_dylib.GetFunction("conv2d");
    assert((f != nullptr) && "function pointer not retrived!");

    DLTensor* data;
    DLTensor* weight;
    DLTensor* bias;
    DLTensor* output;
    int ndim = 4;
    int dtype_code = kDLFloat;
    int dtype_bits = 32;
    int dtype_lanes = 1;
    int device_type = kDLCPU;
    int device_id = 0;
    int64_t shape_data[4] = {1, 512, 7, 7};
    int64_t shape_weight[4] = {512, 512, 3, 3};
    int64_t shape_bias[4] = {1, 512, 1, 1};
    int64_t shape_output[4] = {1, 512, 7, 7};
    TVMArrayAlloc(shape_data, ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &data);
    TVMArrayAlloc(shape_weight, ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &weight);
    TVMArrayAlloc(shape_bias, ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &bias);
    TVMArrayAlloc(shape_output, ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &output);

    struct timeval st, ed;
    for (int i = 0; i < warmup; i++) f(data, weight, bias, output);
    gettimeofday(&st, NULL);
    for (int i = 0; i < iter; i++) {
        f(data, weight, bias, output);
    }
    gettimeofday(&ed, NULL);
    cout << elapsed(st, ed) / (1.0 * iter) << endl;
    
    TVMArrayFree(data);
    TVMArrayFree(weight);
    TVMArrayFree(bias);
    TVMArrayFree(output);
    return 0;
}

and when I execute, the execution time changes suddenly when the cpu mask equals TVM_NUM_THREADS , here’s the result:

>$ ./main 8
41429.9
>$ ./main 16
20690.9
>$ ./main 32
11686.2
>$ ./main 40  # Here CPU masks equals to TVM_NUM_THREADS
934.12

I’m curious about why this happens, as my assumption, if cpu affinity is equal to TVM_NUM_THREADS, every thread can bind a logic cpu, however, when cpu affinity is less, tvm thread need to switch the context, but will there be such a big overhead?

The reason I want to limit the CPU core usage is that I want to launch the kernel via MPI.

SubjectNoi · April 23, 2021, 3:06am

I tried GNU perf, and I have following findings:

This is for export TVM_NUM_THREADS=20

 Performance counter stats for './main':

         12,131.55 msec task-clock                #   22.565 CPUs utilized          
             1,516      context-switches          #    0.125 K/sec                  
                47      cpu-migrations            #    0.004 K/sec                  
            10,760      page-faults               #    0.887 K/sec                  
    32,609,316,421      cycles                    #    2.688 GHz                      (48.57%)
    50,750,525,424      instructions              #    1.56  insn per cycle           (60.59%)
     3,723,837,604      branches                  #  306.955 M/sec                    (60.97%)
         9,001,449      branch-misses             #    0.24% of all branches          (62.12%)
    15,037,970,259      L1-dcache-loads           # 1239.575 M/sec                    (63.23%)
        10,033,182      L1-dcache-load-misses     #    0.07% of all L1-dcache hits    (64.11%)
           915,100      LLC-loads                 #    0.075 M/sec                    (51.24%)
           403,573      LLC-load-misses           #   44.10% of all LL-cache hits     (49.76%)

       0.537622851 seconds time elapsed

       9.256950000 seconds user
       2.880826000 seconds sys

This is for export TVM_NUM_THREADS=40 and set affinity to CPU(0-19)

 Performance counter stats for './main 20':

         41,430.67 msec task-clock                #   20.238 CPUs utilized          
             9,508      context-switches          #    0.229 K/sec                  
             1,467      cpu-migrations            #    0.035 K/sec                  
            11,276      page-faults               #    0.272 K/sec                  
   116,523,683,462      cycles                    #    2.812 GHz                      (49.52%)
    63,758,475,858      instructions              #    0.55  insn per cycle           (61.89%)
     7,571,995,781      branches                  #  182.763 M/sec                    (62.02%)
        10,406,752      branch-misses             #    0.14% of all branches          (62.25%)
    17,267,303,342      L1-dcache-loads           #  416.776 M/sec                    (62.71%)
        17,789,100      L1-dcache-load-misses     #    0.10% of all L1-dcache hits    (63.03%)
         2,038,100      LLC-loads                 #    0.049 M/sec                    (50.41%)
           696,546      LLC-load-misses           #   34.18% of all LL-cache hits     (50.07%)

       2.047178665 seconds time elapsed

      37.869758000 seconds user
       3.577391000 seconds sys

The context switch and CPU migration of set affinity is nearly 10 to 50 times than set env var

masahi · April 23, 2021, 5:57am

TVM uses a custom thread pool that configures CPU affinity. I don’t recommend messing with affinity stuff from the application side.