Does Auto Schedule support INT8 turning on X86-64 CPU

We use auto schedule to turn our model on X84-64 CPU, which is saved as onnx-format and imported by relay.

When the model is FP32, every thing is fine.

But when the model was converted to INT8 by relay.quantize.qconfig, we have received an error "Cannot find tuned schedules for target=llvm -keys=cpu -link-params=0 -mcpu=tigerlake". It seems that the hash key of the task extracted by auto-scheduler is not the same as that of the task to compile.

We try to add disabled_pass={"AutoSchedulerLayoutRewrite"} in model compilation as below,

> with auto_scheduler.ApplyHistoryBest(log_file):
>     with tvm.transform.PassContext(
>         opt_level=3,
>         config={"relay.backend.use_auto_scheduler": True},
>         disabled_pass={"AutoSchedulerLayoutRewrite"},
>     ):
>         lib = relay.build(mod, target=target, params=params)

It works, errors disappear, but the performance is poor.

The existing tuto scheduler doesn’t support int8 optimization. For example, on tigerlake you cannot use VNNI with auto scheduler.

But the next iteration of our auto scheduling system is being developed specifically with exploitation of HW-specific intrinsics in mind. Last week we landed initial support for “auto scheduling with VNNI”, see https://github.com/apache/tvm/pull/11088 and the integration test for int8 BERT.

1 Like

@masahi Hi, may I ask what is VNNI? Does my cpu support ansor int8?

vendor_id	: GenuineIntel
cpu family	: 6
model		: 165
model name	: Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz
stepping	: 5
microcode	: 0xec
cpu MHz		: 2900.000
cache size	: 16384 KB
physical id	: 0
siblings	: 16
core id		: 5
cpu cores	: 8
apicid		: 11
initial apicid	: 11
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp pku ospke md_clear flush_l1d arch_capabilities
vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs pml ept_mode_based_exec
bugs		: spectre_v1 spectre_v2 spec_store_bypass swapgs itlb_multihit
bogomips	: 5799.77
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

VNNI is an int8 dot product instruction available on some Intel CPU. Your CPU doesn’t support it.

Dose autotvm support prequantized tflite (int8) on microTVM?

It is correct assertion. In the same time execution of neural network in int8 mode on any Intel CPU can give up to 2x speedup. VNNI gives up to 4x.

@jinfagang you can try to see perf boost even on Core i7 using AutoTVM with proper target. I.e. target = "llvm -mcpu=core-avx2" for Core(TM) i7-10700. x86 conv2d schedules are developed with int8 intrinsics for SSE4.2/AVX2/AVX512/VNNI. Another note - int8 was not enabled for all platforms for fully connected layer yet (aka matmul/dense).

@elvin-n Hi, I recently have a model searched using ansor but the speed is very slow, even slower than eigen (the model simply some matmul and laynorm etc, very basic matrix calculation).

Could that because of I only set target = llvm when search? Without any specs there.

BTW, how do I know what specs I should specific after llvm am not very familliar with llvm itself or embeded params like avx etc.

Unortunately Ansor will not be able to generate efficient x86 int8 execution. Above int8 code can be generated only with AutoTVM (so far)

it is not enough. There should be "llvm -mcpu=core-avx2" at least. The full list of mcpu can be taken from here depending on target architecture/isa.

  • as I mentioned above efficient int8 is enable on SSE/AVX2/AVX512 only for conv2d. dense requires hardware having VNNI. I.e. if your topology have conv2d mostly - you must see significant perf gain after AutoTVM and codegen with proper target

I am all tested on float32, not even have int8 at all.

Similar problem as this I have, can auto scheduler support INT8 tuning on CUDA Target? I have tried but got some issue.

I guess you should set “target” with avx512, and here is an example on ARM.

2 Likes