Tiled tensors for external (python) functions and TIR == te.extern?

cron · February 5, 2021, 5:01pm

Hello everyone,

I am trying to do some scheduling using te.extern capabilities mixed inside my schedules.

The Setup

I modified the External Tensor Functions tutorial slightly in order to call my external function from inside of the loop of another computation stage.

import tvm
from tvm import te
from tvm import tir

data_shape = (3,4,8)

# Register the pyhton function
@tvm.register_func("my_py_func") 
def my_py_func(data_in,data_mid):
    tvm.nd.array(data_in.asnumpy()+1).copyto(data_mid)

def tutorial_te_sched():
    bias = te.var("bias", dtype="float32")
    data_in = te.placeholder(data_shape)
    data_mid = te.extern(
        data_shape,
        [data_in],
        lambda ins, outs: tvm.tir.call_packed(
        "my_py_func", ins[0], outs[0]))
    data_out = te.compute (data_shape, lambda i,j,k:data_mid[i,j,k])
    s = te.create_schedule(data_out.op)
    s[data_mid].compute_at(s[data_out],s[data_out].op.axis[0])
    return s,[data_in,data_out]


s, place_holders = tutorial_te_sched()
lowered_mod = tvm.lower(s,place_holders)
print(lowered_mod)

The printed output looks as follows

primfn(placeholder_1: handle, compute_1: handle) -> ()
  attr = {"global_symbol": "main", "tir.noalias": True}
  buffers = {compute: Buffer(compute_2: Pointer(float32), float32, [3, 4, 8], []),
             placeholder: Buffer(placeholder_2: Pointer(float32), float32, [3, 4, 8], [])}
  buffer_map = {placeholder_1: placeholder, compute_1: compute} {
  attr [extern: Pointer(float32)] "storage_scope" = "global";
  allocate(extern, float32, [96]); #1 
  for (i: int32, 0, 3) {
    attr [0] "extern_scope" = 0; #2 
     #3 next line
    @tir.tvm_call_packed("my_py_func", @tir.tvm_stack_make_array(placeholder_2, @tir.tvm_stack_make_shape(3, 4, 8, dtype=handle), 0, 3, 0f32, 0, dtype=handle), @tir.tvm_stack_make_array(extern, @tir.tvm_stack_make_shape(3, 4, 8, dtype=handle), 0, 3, 0f32, 0, dtype=handle), dtype=int32) 
    for (j: int32, 0, 4) {
      for (k: int32, 0, 8) {
        compute_2[(((i*32) + (j*8)) + k)] = (float32*)extern[(((i*32) + (j*8)) + k)]
      }
    }
  }
}

Questions

The allocated size for extern is calculated to be 96 which is 3x4x8. Why is the size not calculated to be 32 (4x8) ? I guess that the definition of the shape inside of te.extern is blocking any kind of optimization which leads from introducing this stage into another one.
What exactly is this attr [0] "extern_scope" = 0; statement?
How are te.extern(..., tir.call_packed(...)) statements translated to the shown tir statement? In other words, what would be the way of building the same statement using the tir.ir_builder?

Thanks a lot for your help

cron · February 11, 2021, 3:30pm

@eric-haibin-lin @ziheng any thoughts? I saw you were involved in PR6079 which seems to have been the last modifications to te.extern

huangteng · March 8, 2021, 1:13pm

Hi, I am also trying to apply some schedule primitive methods on tensor.ExternOp, but it seems not supported according to this post (3 years’ ago Optimizing a loop body expressed with tvm.extern), I think what you did is just “assign” the te.extern outputs to another output, however, the axises inside the te.extern remains invisible, which could not be tunned.