Is it meaningful that ansor generates `cache_write` operation when target is llvm?

Siri · April 23, 2023, 7:27am

We know that we cannot manage cache on CPU, so why when I use ansor to optimize a tensor kernel, it still generates cache_write and compute_at operations? My question includes two parts:

the lower IR includes the statement C_local_1 = T.Buffer((512,), "float32x16", data=C_local, scope="local"), what will exactly happen under the hood?
if it is useless, what about removing it when user specifies llvm as the target?

What is more, i’m a little bit confused about some operators in the lower IR, e.g., what is the meaning of % and // below? C_1[e_outer_outer_c_outer_outer_fused_a_outer_outer_fused_d_outer_outer_fused_e_outer_inner_fused // 4 * 131072 + e_inner * 32768 + c_inner * 1024 + e_outer_outer_c_outer_outer_fused_a_outer_outer_fused_d_outer_outer_fused_e_outer_inner_fused % 4 // 2 * 512 + a_outer_inner * 128 + a_inner * 32 + e_outer_outer_c_outer_outer_fused_a_outer_outer_fused_d_outer_outer_fused_e_outer_inner_fused % 2 * 16:e_outer_outer_c_outer_outer_fused_a_outer_outer_fused_d_outer_outer_fused_e_outer_inner_fused // 4 * 131072 + e_inner * 32768 + c_inner * 1024 + e_outer_outer_c_outer_outer_fused_a_outer_outer_fused_d_outer_outer_fused_e_outer_inner_fused % 4 // 2 * 512 + a_outer_inner * 128 + a_inner * 32 + e_outer_outer_c_outer_outer_fused_a_outer_outer_fused_d_outer_outer_fused_e_outer_inner_fused % 2 * 16 + 16] = C_local_1[e_inner * 128 + c_inner * 4 + a_inner]

if there’s some document for the lowering process and how the lowered IR transfers to llvm IR, it would be very helpful.

I’d very much appreciate it if I could get some help, thanks very much!

yzh119 · April 24, 2023, 7:36am

% means modulo, // means divide.

The cache_write basically means creating a temporary buffer to store the intermediate write results, and we can customize the scope of the temp buffer, as you mentioned we don’t have software-managed cache on the CPU, the scope local means we create buffers locally (which is more likely be allocated a register) to store the intermediate results. For GPU backend, we can specify scope shared or shared.dyn, both are software-managed L1 cache.

Answer to your question 1: it means a local buffer with shape (512,) was created, and the data type is a 16-length float16 vector. We have a CompactBufferRegion pass during lowering which will make the shape much smaller.