We know that we cannot manage cache on CPU, so why when I use ansor to optimize a tensor kernel, it still generates cache_write
and compute_at
operations? My question includes two parts:
- the lower IR includes the statement
C_local_1 = T.Buffer((512,), "float32x16", data=C_local, scope="local")
, what will exactly happen under the hood? - if it is useless, what about removing it when user specifies llvm as the target?
What is more, i’m a little bit confused about some operators in the lower IR, e.g., what is the meaning of % and // below?
C_1[e_outer_outer_c_outer_outer_fused_a_outer_outer_fused_d_outer_outer_fused_e_outer_inner_fused // 4 * 131072 + e_inner * 32768 + c_inner * 1024 + e_outer_outer_c_outer_outer_fused_a_outer_outer_fused_d_outer_outer_fused_e_outer_inner_fused % 4 // 2 * 512 + a_outer_inner * 128 + a_inner * 32 + e_outer_outer_c_outer_outer_fused_a_outer_outer_fused_d_outer_outer_fused_e_outer_inner_fused % 2 * 16:e_outer_outer_c_outer_outer_fused_a_outer_outer_fused_d_outer_outer_fused_e_outer_inner_fused // 4 * 131072 + e_inner * 32768 + c_inner * 1024 + e_outer_outer_c_outer_outer_fused_a_outer_outer_fused_d_outer_outer_fused_e_outer_inner_fused % 4 // 2 * 512 + a_outer_inner * 128 + a_inner * 32 + e_outer_outer_c_outer_outer_fused_a_outer_outer_fused_d_outer_outer_fused_e_outer_inner_fused % 2 * 16 + 16] = C_local_1[e_inner * 128 + c_inner * 4 + a_inner]
if there’s some document for the lowering process and how the lowered IR transfers to llvm IR, it would be very helpful.
I’d very much appreciate it if I could get some help, thanks very much!