Tiled tensors for external (python) functions and TIR == te.extern?

Hello everyone,

I am trying to do some scheduling using te.extern capabilities mixed inside my schedules.

The Setup

I modified the External Tensor Functions tutorial slightly in order to call my external function from inside of the loop of another computation stage.

import tvm
from tvm import te
from tvm import tir

data_shape = (3,4,8)

# Register the pyhton function
@tvm.register_func("my_py_func") 
def my_py_func(data_in,data_mid):
    tvm.nd.array(data_in.asnumpy()+1).copyto(data_mid)

def tutorial_te_sched():
    bias = te.var("bias", dtype="float32")
    data_in = te.placeholder(data_shape)
    data_mid = te.extern(
        data_shape,
        [data_in],
        lambda ins, outs: tvm.tir.call_packed(
        "my_py_func", ins[0], outs[0]))
    data_out = te.compute (data_shape, lambda i,j,k:data_mid[i,j,k])
    s = te.create_schedule(data_out.op)
    s[data_mid].compute_at(s[data_out],s[data_out].op.axis[0])
    return s,[data_in,data_out]


s, place_holders = tutorial_te_sched()
lowered_mod = tvm.lower(s,place_holders)
print(lowered_mod)

The printed output looks as follows

primfn(placeholder_1: handle, compute_1: handle) -> ()
  attr = {"global_symbol": "main", "tir.noalias": True}
  buffers = {compute: Buffer(compute_2: Pointer(float32), float32, [3, 4, 8], []),
             placeholder: Buffer(placeholder_2: Pointer(float32), float32, [3, 4, 8], [])}
  buffer_map = {placeholder_1: placeholder, compute_1: compute} {
  attr [extern: Pointer(float32)] "storage_scope" = "global";
  allocate(extern, float32, [96]); #1 
  for (i: int32, 0, 3) {
    attr [0] "extern_scope" = 0; #2 
     #3 next line
    @tir.tvm_call_packed("my_py_func", @tir.tvm_stack_make_array(placeholder_2, @tir.tvm_stack_make_shape(3, 4, 8, dtype=handle), 0, 3, 0f32, 0, dtype=handle), @tir.tvm_stack_make_array(extern, @tir.tvm_stack_make_shape(3, 4, 8, dtype=handle), 0, 3, 0f32, 0, dtype=handle), dtype=int32) 
    for (j: int32, 0, 4) {
      for (k: int32, 0, 8) {
        compute_2[(((i*32) + (j*8)) + k)] = (float32*)extern[(((i*32) + (j*8)) + k)]
      }
    }
  }
}

Questions

  1. The allocated size for extern is calculated to be 96 which is 3x4x8. Why is the size not calculated to be 32 (4x8) ? I guess that the definition of the shape inside of te.extern is blocking any kind of optimization which leads from introducing this stage into another one.

  2. What exactly is this attr [0] "extern_scope" = 0; statement?

  3. How are te.extern(..., tir.call_packed(...)) statements translated to the shown tir statement? In other words, what would be the way of building the same statement using the tir.ir_builder?

Thanks a lot for your help

@eric-haibin-lin @ziheng any thoughts? I saw you were involved in PR6079 which seems to have been the last modifications to te.extern