Hello everyone,
I am trying to do some scheduling using te.extern
capabilities mixed inside my schedules.
The Setup
I modified the External Tensor Functions tutorial slightly in order to call my external function from inside of the loop of another computation stage.
import tvm
from tvm import te
from tvm import tir
data_shape = (3,4,8)
# Register the pyhton function
@tvm.register_func("my_py_func")
def my_py_func(data_in,data_mid):
tvm.nd.array(data_in.asnumpy()+1).copyto(data_mid)
def tutorial_te_sched():
bias = te.var("bias", dtype="float32")
data_in = te.placeholder(data_shape)
data_mid = te.extern(
data_shape,
[data_in],
lambda ins, outs: tvm.tir.call_packed(
"my_py_func", ins[0], outs[0]))
data_out = te.compute (data_shape, lambda i,j,k:data_mid[i,j,k])
s = te.create_schedule(data_out.op)
s[data_mid].compute_at(s[data_out],s[data_out].op.axis[0])
return s,[data_in,data_out]
s, place_holders = tutorial_te_sched()
lowered_mod = tvm.lower(s,place_holders)
print(lowered_mod)
The printed output looks as follows
primfn(placeholder_1: handle, compute_1: handle) -> ()
attr = {"global_symbol": "main", "tir.noalias": True}
buffers = {compute: Buffer(compute_2: Pointer(float32), float32, [3, 4, 8], []),
placeholder: Buffer(placeholder_2: Pointer(float32), float32, [3, 4, 8], [])}
buffer_map = {placeholder_1: placeholder, compute_1: compute} {
attr [extern: Pointer(float32)] "storage_scope" = "global";
allocate(extern, float32, [96]); #1
for (i: int32, 0, 3) {
attr [0] "extern_scope" = 0; #2
#3 next line
@tir.tvm_call_packed("my_py_func", @tir.tvm_stack_make_array(placeholder_2, @tir.tvm_stack_make_shape(3, 4, 8, dtype=handle), 0, 3, 0f32, 0, dtype=handle), @tir.tvm_stack_make_array(extern, @tir.tvm_stack_make_shape(3, 4, 8, dtype=handle), 0, 3, 0f32, 0, dtype=handle), dtype=int32)
for (j: int32, 0, 4) {
for (k: int32, 0, 8) {
compute_2[(((i*32) + (j*8)) + k)] = (float32*)extern[(((i*32) + (j*8)) + k)]
}
}
}
}
Questions
-
The allocated size for
extern
is calculated to be 96 which is 3x4x8. Why is the size not calculated to be 32 (4x8) ? I guess that the definition of the shape inside ofte.extern
is blocking any kind of optimization which leads from introducing this stage into another one. -
What exactly is this
attr [0] "extern_scope" = 0;
statement? -
How are
te.extern(..., tir.call_packed(...))
statements translated to the shown tir statement? In other words, what would be the way of building the same statement using thetir.ir_builder
?
Thanks a lot for your help