Thanks for your reply, appreciate the help.
Is that possible to achieve using TE?
I am trying to implement a matmul schedule as such:
M = te.var("M", "int32")
K = te.var("K", "int32")
N = te.var("N", "int32")
# Algorithm
k = te.reduce_axis((0, K), "k")
A = te.placeholder((M, K), name="A")
B = te.placeholder((K, N), name="B")
C = te.compute((M, N), lambda m, n: te.sum(A[m, k] * B[k, n], axis=k), name="C")
######### START OF SCHEDULE ###########
s = te.create_schedule(C.op)
##### System ####
##### Read (A,B) External -> Internal
AL = s.cache_read(A, "local.vtcm", [C])
BL = s.cache_read(B, "local.vtcm", [C])
##### Write (C) Internal -> External
CC = s.cache_write(C, "local.vtcm_Block_2")
##### For now, lets set tile_k to be 128
tile_k = 128
tile_m = 64
tile_n = 64
##### Tiling for Reading AL buf , BL buf , and for the kernel CC #####
m_a, n_a = s[AL].op.axis
ma_outer,na_outer,ma_inner,na_inner = s[AL].tile(m_a,n_a,tile_m,tile_k)
m_b, n_b = s[BL].op.axis
mb_outer,nb_outer,mb_inner,nb_inner = s[BL].tile(m_b,n_b,tile_k,tile_n)
mc, nc = s[CC].op.axis
mc_outer,nc_outer,mc_inner,nc_inner = s[CC].tile(mc,nc,tile_m,tile_n)
##### Kernel ####
# We want to load 16 units each iteration
no, ni = s[CC].split(nc_inner, factor=16)
(kaxis,) = s[CC].op.reduce_axis
ko,ki = s[CC].split(kaxis, factor=tile_k)
s[CC].reorder(ko,mc_inner,ki, no, ni)
s[CC].vectorize(ni)
##### Tiling for Writing C to external ######
m, n = s[C].op.axis
m_outer,n_outer,m_inner,n_inner = s[C].tile(m,n,tile_m,tile_n)
###### move Read and Write to the compute Tiling iterator #######
s[AL].compute_at(s[CC],ko)
s[BL].compute_at(s[CC],ko)
s[CC].compute_at(s[C],n_outer)
The following IR is generated for AL, BL:
A_local_vtcm = T.allocate([8192], "float32", "local.vtcm")
B_local_vtcm = T.allocate([8192], "float32", "local.vtcm")
However, I need it in this shape: [2][4096].
If not possible to achieve in TE, how does axis_separator supported in the DeviceAPI in the AllocDataSpace function?