Compiling model with target="llvm" not faster

Hi, I’m new in tvm, today I use tvm to speed up inference for model from onnx format. But after that, model inference slower than the original model 5x time:

original model onnx: 0.7795 s
model building with tvm: 3.4420 s
model after tuning: 2.2411 s

Some code:

TARGET = "llvm"

st_onnx = time.time()
onnx_model = onnx.load(onnx_path)
input_name = "input"
shape_dict = {input_name: image_infos.shape}
mod, params = relay.frontend.from_onnx(onnx_model, shape_dict)
 
print("Time load onnx to tvm: {:0.4f}".format(time.time() - st_onnx))
with tvm.transform.PassContext(opt_level=3):
    lib = relay.build(mod, target=TARGET, params=params)

model_module = graph_executor.GraphModule(lib["default"](tvm.device(str(TARGET), 0)))
model_module.set_input("input", image_infos)
model_module.run()
tvm_output = model_module.get_output(0)

# Tuning
def tune(mod, params, X_ex):
    number = 10
    repeat = 1
    min_repeat_ms = 0  # since we're tuning on a CPU, can be set to 0
    timeout = 10  # in seconds

    # create a TVM runner
    runner = autotvm.LocalRunner(
        number=number,
        repeat=repeat,
        timeout=timeout,
        min_repeat_ms=min_repeat_ms,
    )

    tuning_option = {
        "tuner": "xgb",
        "trials": 10,
        "early_stopping": 100,
        "measure_option": autotvm.measure_option(
            builder=autotvm.LocalBuilder(build_func="default"), runner=runner
        ),
        "tuning_records": "fied_extraction-autotuning.json",
    }

    tasks = autotvm.task.extract_from_program(mod["main"], target=TARGET, params=params)

    for i, task in enumerate(tasks):
        prefix = "[Task %2d/%2d] " % (i + 1, len(tasks))
        tuner_obj = XGBTuner(task, loss_type="rank")
        tuner_obj.tune(
            n_trial=min(tuning_option["trials"], len(task.config_space)),
            early_stopping=tuning_option["early_stopping"],
            measure_option=tuning_option["measure_option"],
            callbacks=[
                autotvm.callback.progress_bar(tuning_option["trials"], prefix=prefix),
                autotvm.callback.log_to_file(tuning_option["tuning_records"]),
            ],
        )

    return
tune(mod, params, image_infos)

I don’t know why is that
Is something I missing !?
Cpu Info:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              12
On-line CPU(s) list: 0-11
Thread(s) per core:  2
Core(s) per socket:  6
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               165
Model name:          Intel(R) Core(TM) i5-10500 CPU @ 3.10GHz
Stepping:            3
CPU MHz:             1114.296
CPU max MHz:         4500,0000
CPU min MHz:         800,0000
BogoMIPS:            6199.99
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            12288K
NUMA node0 CPU(s):   0-11

This is Comet Lake CPU
I’m using instruction to install tvm:

#!/bin/bash
set -ex
# https://tvm.apache.org/docs/install/from_source.html#install-from-source
if [[ ! -d "/tmp/tvm" ]]; then
    git clone --recursive https://github.com/apache/tvm /tmp/tvm
fi
apt-get update && \
    apt-get install -y gcc libtinfo-dev zlib1g-dev \
        build-essential cmake libedit-dev libxml2-dev \
        llvm-6.0 \
        libgomp1  # S0#61786308
if [[ ! -d "/tmp/tvm/build" ]]; then
    mkdir /tmp/tvm/build
fi
cp /tmp/tvm/cmake/config.cmake /tmp/tvm/build
mv /tmp/tvm/build/config.cmake /tmp/tvm/build/~config.cmake && \
    cat /tmp/tvm/build/~config.cmake | \
        sed -E "s|set\(USE_GRAPH_RUNTIME OFF\)|set\(USE_GRAPH_RUNTIME ON\)|" | \
        sed -E "s|set\(USE_GRAPH_RUNTIME_DEBUG OFF\)|set\(USE_GRAPH_RUNTIME_DEBUG ON\)|" | \
        sed -E "s|set\(USE_LLVM OFF\)|set\(USE_LLVM /usr/bin/llvm-config-6.0\)|" > \
        /tmp/tvm/build/config.cmake
cd /tmp/tvm/build && cmake .. && make -j4
cd /tmp/tvm/python && /usr/local/envs/tvm/bin/python setup.py install --user && cd ..

Short answer:

  1. need to use different target and target_host
  2. need to use Auto Scheduler to tune network and recompile network with selection of best variations of kernels

Long answer: There are several aspects to get best performance

  1. You need to use appropriate target and target_host pointing additionally -mcpu=architecture. Architecture defines which vector instructions to use. The difference between just “llvm” and “llvm -mcpu=core-avx2” might be in several times

The -mcpu=... should be pointed in both - target and target_host because the first affects the TVM algorithms during network compilation and second affects to the llvm code generation.

The list of available architectures can be found here, for example: llvm-project/X86TargetParser.cpp at main · llvm/llvm-project · GitHub And you need to select the highest required architectures, not the toppest. If you select skylake-avx512, you will not be able to run this network on desktop platforms or AMD

  1. TVM has conception - schedule. Really the compute of operation describes naive schedule already, but it is not efficient. To get the benefits of certain u-arch, caches, etc, you need to have individual schedule for same operation for different hardware. I.e. on x86 it is important to split some loops by factor 8 or 16, change memory layout and this enables llvm to use vector instructions. There are a lot of default schedules in TVM but they cannot cover all parameters of layers and all different uarch - i.e. cache size or number of cores.

Thus way, to get the best performance you must use AutoTVM or AutoScheduler. This is a process of selection different parameters of schedule for each complex operation in network and saving information into log file. When tuning finishes, you need to compile network with this information and the best schedule will be selected for each kernel.

The difference between AutoTVM - it is based on handmade schedules, AutoScheduler is based on compute and can create schedule from scratch.

On my experience Auto Scheduler gives much better performance results for floating networks on x86 architecture, you need to do less measurements during tuning, and tuning starts to work much faster (e.g. couple hours instead of a day) and it optimizes all layers regularly, no need to wit all layers finished. As well according to Auto Scheduler paper, it is better for other architectures as well.

As for AutoTVM benefits - in case of x86, llvm cannot generate yet efficient i8 instructions. And the only way so far to get best performance on int8 network on x86 architecture - use predefined schedule which enables these intrinsics. And these schedules can be tuned only by AutoTVM.

1 Like

Please autotune your model to get the best performance.

How to AutoTune model?

You can start AutoTVM or AutoSchedule from tutorials,

Thanks for your answer
I tried following your answer, but I got some issues:

  1. Exactly, the change target to “llvm -mcpu=core-avx2” a littel bit faster time inference. Improve from 3.4420s to 2.162s but still not as fast as inferencing on original model 0.7795s
  2. I followed the instruction tuturial Auto Scheduler here: Auto-scheduling a Neural Network for x86 CPU — tvm 0.8.dev0 documentation but model still not faster, model inference in 2.2587s same model compile with tvm. Here is some code tuning
    log_file = "fied_extraction-autotuning.json"
    st_onnx = time.time()
    onnx_model = onnx.load(onnx_path)
    input_name = "input"
    shape_dict = {input_name: image_infos.shape}

    mod, params = relay.frontend.from_onnx(onnx_model, shape_dict)
    print("Time to load onnx to tvm: {:0.4f}".format(time.time() - st_onnx))

    print("Extract tasks...")
    tasks, task_weights = auto_scheduler.extract_tasks(mod["main"], params, TARGET)

    for idx, task in enumerate(tasks):
        print("========== Task %d  (workload key: %s) ==========" % (idx, task.workload_key))
        print(task.compute_dag)

    def run_tuning():
        print("Begin tuning...")
        tuner = auto_scheduler.TaskScheduler(tasks, task_weights)
        tune_option = auto_scheduler.TuningOptions(
            num_measure_trials=200,  # change this to 20000 to achieve the best performance
            runner=auto_scheduler.LocalRunner(repeat=10, enable_cpu_cache_flush=True),
            measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
        )
        tuner.tune(tune_option)
# Run tuning model
    run_tuning()
num_measure_trials=200

do you have a model including only one layer? In general 200 is too small number. You need at least 200*number of layers to tune. You also can take a look into total_latency.tsv in current directory. It reports sum of all layers latency and if you see that this number stopped to be decreased, you are done, no sense to tune more. Until you see progress, it make sense to continue. And nice fact about auto scheduler in opposite to auto tvm - you can just press ctrl-c and don’t wait end of tuning. for AutoTVM you have to wait

1 Like

I tried increased num_measure_trials=20000, the number layers in models is 29, the number tasks is roundly 800*29=23200 and I chosen 20000 . End after that i got the same result with a model onnx running on onnxruntime

onnxruntime model: 0.4735 s
tvm model after tuning: 0.5078 s

Is there running faster !? Here is some tail in total_latency.tsv

ElapsedTime(s)	17787	EstimatedLatency(ms)	471.479	Trials	10176
ElapsedTime(s)	17837	EstimatedLatency(ms)	471.479	Trials	10240
ElapsedTime(s)	18032	EstimatedLatency(ms)	471.479	Trials	10304
ElapsedTime(s)	18128	EstimatedLatency(ms)	471.479	Trials	10368
ElapsedTime(s)	18224	EstimatedLatency(ms)	471.442	Trials	10432
ElapsedTime(s)	18275	EstimatedLatency(ms)	470.814	Trials	10496
ElapsedTime(s)	18328	EstimatedLatency(ms)	469.353	Trials	10560
ElapsedTime(s)	18379	EstimatedLatency(ms)	468.845	Trials	10624
ElapsedTime(s)	18461	EstimatedLatency(ms)	468.845	Trials	10688
ElapsedTime(s)	18658	EstimatedLatency(ms)	468.845	Trials	10752
ElapsedTime(s)	18692	EstimatedLatency(ms)	467.702	Trials	10816
ElapsedTime(s)	18738	EstimatedLatency(ms)	467.702	Trials	10880
ElapsedTime(s)	18934	EstimatedLatency(ms)	467.702	Trials	10944
ElapsedTime(s)	18980	EstimatedLatency(ms)	467.702	Trials	11008
ElapsedTime(s)	19076	EstimatedLatency(ms)	467.702	Trials	11072
ElapsedTime(s)	19164	EstimatedLatency(ms)	467.702	Trials	11136
ElapsedTime(s)	19360	EstimatedLatency(ms)	467.702	Trials	11200
ElapsedTime(s)	19451	EstimatedLatency(ms)	467.702	Trials	11264
ElapsedTime(s)	19645	EstimatedLatency(ms)	467.702	Trials	11328
ElapsedTime(s)	19697	EstimatedLatency(ms)	467.702	Trials	11392
ElapsedTime(s)	19785	EstimatedLatency(ms)	467.702	Trials	11456
ElapsedTime(s)	19886	EstimatedLatency(ms)	467.702	Trials	11520
ElapsedTime(s)	19946	EstimatedLatency(ms)	467.702	Trials	11584
ElapsedTime(s)	20138	EstimatedLatency(ms)	467.702	Trials	11648
ElapsedTime(s)	20199	EstimatedLatency(ms)	467.702	Trials	11712
ElapsedTime(s)	20402	EstimatedLatency(ms)	467.702	Trials	11776
ElapsedTime(s)	20459	EstimatedLatency(ms)	466.907	Trials	11840
ElapsedTime(s)	20544	EstimatedLatency(ms)	466.907	Trials	11904
ElapsedTime(s)	20658	EstimatedLatency(ms)	466.907	Trials	11968
ElapsedTime(s)	20751	EstimatedLatency(ms)	466.907	Trials	12032
ElapsedTime(s)	20945	EstimatedLatency(ms)	466.907	Trials	12096
ElapsedTime(s)	20987	EstimatedLatency(ms)	465.989	Trials	12160
ElapsedTime(s)	21176	EstimatedLatency(ms)	465.989	Trials	12224
ElapsedTime(s)	21219	EstimatedLatency(ms)	465.989	Trials	12288
ElapsedTime(s)	21326	EstimatedLatency(ms)	465.989	Trials	12352
ElapsedTime(s)	21423	EstimatedLatency(ms)	465.989	Trials	12416
ElapsedTime(s)	21472	EstimatedLatency(ms)	465.989	Trials	12480
ElapsedTime(s)	21665	EstimatedLatency(ms)	465.989	Trials	12544
ElapsedTime(s)	21720	EstimatedLatency(ms)	465.989	Trials	12608
ElapsedTime(s)	21784	EstimatedLatency(ms)	465.747	Trials	12672
ElapsedTime(s)	21877	EstimatedLatency(ms)	465.747	Trials	12736
ElapsedTime(s)	22071	EstimatedLatency(ms)	465.747	Trials	12800
ElapsedTime(s)	22181	EstimatedLatency(ms)	465.747	Trials	12864
ElapsedTime(s)	22274	EstimatedLatency(ms)	465.488	Trials	12928
ElapsedTime(s)	22465	EstimatedLatency(ms)	464.841	Trials	12992
ElapsedTime(s)	22522	EstimatedLatency(ms)	464.773	Trials	13056
ElapsedTime(s)	22719	EstimatedLatency(ms)	464.773	Trials	13120
ElapsedTime(s)	22807	EstimatedLatency(ms)	464.773	Trials	13184
ElapsedTime(s)	23010	EstimatedLatency(ms)	464.773	Trials	13248
ElapsedTime(s)	23063	EstimatedLatency(ms)	464.773	Trials	13312
ElapsedTime(s)	23162	EstimatedLatency(ms)	464.773	Trials	13376
ElapsedTime(s)	23264	EstimatedLatency(ms)	464.773	Trials	13440
ElapsedTime(s)	23465	EstimatedLatency(ms)	464.773	Trials	13504
ElapsedTime(s)	23533	EstimatedLatency(ms)	464.773	Trials	13568
ElapsedTime(s)	23633	EstimatedLatency(ms)	464.773	Trials	13632
ElapsedTime(s)	23688	EstimatedLatency(ms)	464.773	Trials	13696
ElapsedTime(s)	23778	EstimatedLatency(ms)	464.773	Trials	13760
ElapsedTime(s)	23865	EstimatedLatency(ms)	464.773	Trials	13824
ElapsedTime(s)	24060	EstimatedLatency(ms)	464.773	Trials	13888
ElapsedTime(s)	24111	EstimatedLatency(ms)	464.773	Trials	13952
ElapsedTime(s)	24158	EstimatedLatency(ms)	464.773	Trials	14016
ElapsedTime(s)	24355	EstimatedLatency(ms)	464.773	Trials	14080
ElapsedTime(s)	24455	EstimatedLatency(ms)	464.773	Trials	14144
ElapsedTime(s)	24652	EstimatedLatency(ms)	464.773	Trials	14208
ElapsedTime(s)	24749	EstimatedLatency(ms)	464.773	Trials	14272
ElapsedTime(s)	24844	EstimatedLatency(ms)	464.773	Trials	14336
ElapsedTime(s)	24922	EstimatedLatency(ms)	464.773	Trials	14400
ElapsedTime(s)	25119	EstimatedLatency(ms)	464.773	Trials	14464
ElapsedTime(s)	25219	EstimatedLatency(ms)	464.773	Trials	14528
ElapsedTime(s)	25305	EstimatedLatency(ms)	464.773	Trials	14592
ElapsedTime(s)	25356	EstimatedLatency(ms)	464.773	Trials	14656
ElapsedTime(s)	25555	EstimatedLatency(ms)	464.773	Trials	14720
ElapsedTime(s)	25610	EstimatedLatency(ms)	464.109	Trials	14784
ElapsedTime(s)	25656	EstimatedLatency(ms)	463.423	Trials	14848
ElapsedTime(s)	25703	EstimatedLatency(ms)	463.383	Trials	14912
ElapsedTime(s)	25784	EstimatedLatency(ms)	463.383	Trials	14976
ElapsedTime(s)	25982	EstimatedLatency(ms)	463.383	Trials	15040
ElapsedTime(s)	26039	EstimatedLatency(ms)	463.383	Trials	15104
ElapsedTime(s)	26131	EstimatedLatency(ms)	463.383	Trials	15168
ElapsedTime(s)	26174	EstimatedLatency(ms)	462.095	Trials	15232
ElapsedTime(s)	26370	EstimatedLatency(ms)	461.838	Trials	15296
ElapsedTime(s)	26464	EstimatedLatency(ms)	461.647	Trials	15360
ElapsedTime(s)	26667	EstimatedLatency(ms)	461.647	Trials	15424
ElapsedTime(s)	26723	EstimatedLatency(ms)	460.820	Trials	15488
ElapsedTime(s)	26777	EstimatedLatency(ms)	460.820	Trials	15552
ElapsedTime(s)	26856	EstimatedLatency(ms)	460.820	Trials	15616
ElapsedTime(s)	26960	EstimatedLatency(ms)	460.820	Trials	15680
ElapsedTime(s)	27057	EstimatedLatency(ms)	460.820	Trials	15744
ElapsedTime(s)	27256	EstimatedLatency(ms)	460.820	Trials	15808
ElapsedTime(s)	27352	EstimatedLatency(ms)	460.820	Trials	15872
ElapsedTime(s)	27553	EstimatedLatency(ms)	460.820	Trials	15936
ElapsedTime(s)	27610	EstimatedLatency(ms)	460.739	Trials	16000
ElapsedTime(s)	27658	EstimatedLatency(ms)	460.739	Trials	16064
ElapsedTime(s)	27760	EstimatedLatency(ms)	460.739	Trials	16128
ElapsedTime(s)	27803	EstimatedLatency(ms)	460.502	Trials	16192
ElapsedTime(s)	28003	EstimatedLatency(ms)	460.123	Trials	16256
ElapsedTime(s)	28062	EstimatedLatency(ms)	459.634	Trials	16320
ElapsedTime(s)	28171	EstimatedLatency(ms)	459.381	Trials	16384
ElapsedTime(s)	28371	EstimatedLatency(ms)	457.684	Trials	16448
ElapsedTime(s)	28569	EstimatedLatency(ms)	457.684	Trials	16512
ElapsedTime(s)	28765	EstimatedLatency(ms)	457.684	Trials	16576
ElapsedTime(s)	28854	EstimatedLatency(ms)	457.684	Trials	16640
ElapsedTime(s)	29043	EstimatedLatency(ms)	457.570	Trials	16704
ElapsedTime(s)	29151	EstimatedLatency(ms)	457.570	Trials	16768
ElapsedTime(s)	29244	EstimatedLatency(ms)	457.570	Trials	16832
ElapsedTime(s)	29315	EstimatedLatency(ms)	457.570	Trials	16896
ElapsedTime(s)	29373	EstimatedLatency(ms)	457.570	Trials	16960
ElapsedTime(s)	29461	EstimatedLatency(ms)	457.570	Trials	17024
ElapsedTime(s)	29561	EstimatedLatency(ms)	457.570	Trials	17088
ElapsedTime(s)	29622	EstimatedLatency(ms)	456.816	Trials	17152
ElapsedTime(s)	29717	EstimatedLatency(ms)	456.667	Trials	17216
ElapsedTime(s)	29814	EstimatedLatency(ms)	456.667	Trials	17280
ElapsedTime(s)	29864	EstimatedLatency(ms)	456.667	Trials	17344
ElapsedTime(s)	29907	EstimatedLatency(ms)	455.467	Trials	17408
ElapsedTime(s)	29999	EstimatedLatency(ms)	455.406	Trials	17472
ElapsedTime(s)	30085	EstimatedLatency(ms)	455.406	Trials	17536
ElapsedTime(s)	30286	EstimatedLatency(ms)	455.406	Trials	17600
ElapsedTime(s)	30337	EstimatedLatency(ms)	455.406	Trials	17664
ElapsedTime(s)	30411	EstimatedLatency(ms)	455.406	Trials	17728
ElapsedTime(s)	30600	EstimatedLatency(ms)	455.406	Trials	17792
ElapsedTime(s)	30706	EstimatedLatency(ms)	455.406	Trials	17856
ElapsedTime(s)	30764	EstimatedLatency(ms)	455.406	Trials	17920
ElapsedTime(s)	30964	EstimatedLatency(ms)	455.406	Trials	17984
ElapsedTime(s)	31076	EstimatedLatency(ms)	454.799	Trials	18048
ElapsedTime(s)	31174	EstimatedLatency(ms)	454.799	Trials	18112
ElapsedTime(s)	31288	EstimatedLatency(ms)	454.799	Trials	18176
ElapsedTime(s)	31342	EstimatedLatency(ms)	454.799	Trials	18240
ElapsedTime(s)	31402	EstimatedLatency(ms)	454.502	Trials	18304
ElapsedTime(s)	31598	EstimatedLatency(ms)	454.502	Trials	18368
ElapsedTime(s)	31693	EstimatedLatency(ms)	454.435	Trials	18432
ElapsedTime(s)	31887	EstimatedLatency(ms)	454.435	Trials	18496
ElapsedTime(s)	31955	EstimatedLatency(ms)	454.435	Trials	18560
ElapsedTime(s)	32063	EstimatedLatency(ms)	454.435	Trials	18624
ElapsedTime(s)	32164	EstimatedLatency(ms)	454.435	Trials	18688
ElapsedTime(s)	32362	EstimatedLatency(ms)	453.925	Trials	18752
ElapsedTime(s)	32560	EstimatedLatency(ms)	453.769	Trials	18816
ElapsedTime(s)	32751	EstimatedLatency(ms)	453.270	Trials	18880
ElapsedTime(s)	32948	EstimatedLatency(ms)	453.270	Trials	18944
ElapsedTime(s)	33044	EstimatedLatency(ms)	453.270	Trials	19008
ElapsedTime(s)	33100	EstimatedLatency(ms)	453.270	Trials	19072
ElapsedTime(s)	33215	EstimatedLatency(ms)	453.270	Trials	19136
ElapsedTime(s)	33281	EstimatedLatency(ms)	453.086	Trials	19200
ElapsedTime(s)	33481	EstimatedLatency(ms)	453.086	Trials	19264
ElapsedTime(s)	33576	EstimatedLatency(ms)	453.086	Trials	19328
ElapsedTime(s)	33677	EstimatedLatency(ms)	452.989	Trials	19392
ElapsedTime(s)	33741	EstimatedLatency(ms)	452.989	Trials	19456
ElapsedTime(s)	33801	EstimatedLatency(ms)	452.989	Trials	19520
ElapsedTime(s)	33873	EstimatedLatency(ms)	452.989	Trials	19584
ElapsedTime(s)	33975	EstimatedLatency(ms)	452.989	Trials	19648
ElapsedTime(s)	34174	EstimatedLatency(ms)	452.989	Trials	19712
ElapsedTime(s)	34221	EstimatedLatency(ms)	452.103	Trials	19776
ElapsedTime(s)	34326	EstimatedLatency(ms)	452.103	Trials	19840
ElapsedTime(s)	34439	EstimatedLatency(ms)	452.103	Trials	19904
ElapsedTime(s)	34557	EstimatedLatency(ms)	452.103	Trials	19968
ElapsedTime(s)	34615	EstimatedLatency(ms)	452.103	Trials	20032

I could have ctrl-c, but I want the model improving as much as possible
It took me 1 day to wait for the auto tuning process to finish and the results didn’t improve much :frowning:

It might be that onnxruntime was able to use hardware resources the most efficient way and improving of the inference time more is possible but might be hard. And TVM get the same perfect result. It’s hard to say without looking into the model. Is it publically available model? Does it have more conv layers or matmul/dense?

Is it the same model as in the beginning? If it is the same - there is a progress comparing to different tvm results. From 3.4s to 0.5 seconds. As for quote of tsv file - I see a part from 10000 trials to 20000. Probably tuning had to be stopped on that 1000th trial or early. At the same time I am pretty sure that if we take to the first lines, results should be improved significantly during tuning.

Model pure pytorch inference in: 0.7795s compare with tvm model after tuning is faster by 0.2s

what are the “fist lines” you mentioned here !?
I don’t quite understand this sentence

I see in quote a part of the tuning trace starting from ElapsedTime(s) 17787 EstimatedLatency(ms) 471.479 Trials 10176, it refers to 10899 trial, and I referred it as “first line”. while if you take a look into full file, the first line should start from 29*64=1856 trial. and perf from this 1856 to 10176 should be improved significantly

Yep, the first line 1856 trial EstimatedLatency(ms) is 1063.960 s