Targetting 'c' in VTA's deployment example ==> TVMError: Unresolved call Op(tir.round)

aca88 · September 14, 2020, 7:02am

Hello,

I wanted to get a better view of what the output of compiling a network in the VTA deployment example is.

More specifically, I wanted to get the C source after building the module. Trying to read this output in LLVM was giving me headaches.

For that I changed the ‘llvm’ target when using fsim to ‘c’.

The compilation throws the following error:

Exception has occurred: TVMError
Traceback (most recent call last):
  [bt] (8) /home/tvm/build/libtvm.so(tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::CastNode const*, std::ostream&)+0x227) [0x7f653bb80e27]
  [bt] (7) /home/tvm/build/libtvm.so(tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)+0xad) [0x7f653bb7beed]
  [bt] (6) /home/tvm/build/libtvm.so(void tvm::codegen::CodeGenCHost::PrintTernaryCondExpr<tvm::tir::MaxNode>(tvm::tir::MaxNode const*, char const*, std::ostream&)+0x57) [0x7f653bb92527]
  [bt] (5) /home/tvm/build/libtvm.so(tvm::tir::ExprFunctor<void (tvm::PrimExpr const&, std::ostream&)>::VisitExpr(tvm::PrimExpr const&, std::ostream&)+0x7c) [0x7f653bb8a97c]
  [bt] (4) /home/tvm/build/libtvm.so(void tvm::codegen::CodeGenCHost::PrintTernaryCondExpr<tvm::tir::MinNode>(tvm::tir::MinNode const*, char const*, std::ostream&)+0x57) [0x7f653bb920b7]
  [bt] (3) /home/tvm/build/libtvm.so(tvm::tir::ExprFunctor<void (tvm::PrimExpr const&, std::ostream&)>::VisitExpr(tvm::PrimExpr const&, std::ostream&)+0x7c) [0x7f653bb8a97c]
  [bt] (2) /home/tvm/build/libtvm.so(tvm::codegen::CodeGenCHost::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)+0x70) [0x7f653bb8f420]
  [bt] (1) /home/tvm/build/libtvm.so(tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)+0x3a2) [0x7f653bb82262]
  [bt] (0) /home/tvm/build/libtvm.so(+0x11fc7b2) [0x7f653bb787b2]
  File "/home/tvm/src/target/source/codegen_c.cc", line 649
TVMError: Unresolved call Op(tir.round)
  File "/home/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 225, in __call__
    raise get_last_ffi_error()
  File "/home/tvm/python/tvm/relay/build_module.py", line 121, in build
    self._build(mod, target, target_host)
  File "/home/tvm/python/tvm/relay/build_module.py", line 255, in build
    graph_json, mod, params = bld_mod.build(mod, target, target_host, params)
  File "/home/tvm/vta/tutorials/frontend/deploy_classification.py", line 200, in <module>
    params=params, target_host=env.target_host)

As a side note, the conv2d optimization tutorial for VTA with the same change to the fsim target (i.e. ‘c’ instead of ‘llvm’) does work. But the C file does not include the VTA specific include statements.

#include "tvm/runtime/c_runtime_api.h"
#include "tvm/runtime/c_backend_api.h"
void* __tvm_module_ctx = NULL;
static void* __tvm_set_device_packed = NULL;
#ifdef __cplusplus
extern "C"
#endif
//Rest of C code

cc @thierry any thoughts on the matter?

thierry · September 19, 2020, 7:24pm

Hi @aca88, I haven’t tried to change the VTA codegen from llvm to c target.

Perhaps one good way to get an understanding of the VTA codegen path and how the runtime API (https://github.com/apache/incubator-tvm/blob/master/vta/runtime/runtime.h) is being invoked is to print out the lowered TVM IR.

The matrix multiply example (https://tvm.apache.org/docs/vta/tutorials/matrix_multiply.html) should show you how to obtain a detailed IR dump:

primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
  attr = {"global_symbol": "main", "tir.noalias": True}
  buffers = {C: Buffer(C_2: Pointer(int8), int8, [1, 16, 1, 16], []),
             B: Buffer(B_2: Pointer(int8), int8, [16, 16, 16, 16], []),
             A: Buffer(A_2: Pointer(int8), int8, [1, 16, 1, 16], [])}
  buffer_map = {A_1: A, B_1: B, C_1: C} {
  attr [C_buf: Pointer(int32)] "storage_scope" = "local.acc_buffer";
  attr [A_buf: Pointer(int8)] "storage_scope" = "local.inp_buffer";
  attr [B_buf: Pointer(int8)] "storage_scope" = "local.wgt_buffer" {
    attr [IterVar(vta: int32, (nullptr), "ThreadIndex", "vta")] "coproc_scope" = 2 {
      attr [IterVar(vta, (nullptr), "ThreadIndex", "vta")] "coproc_uop_scope" = "VTAPushGEMMOp" {
        @tir.call_extern("VTAUopLoopBegin", 16, 1, 0, 0, dtype=int32)
        @tir.vta.uop_push(0, 1, 0, 0, 0, 0, 0, 0, dtype=int32)
        @tir.call_extern("VTAUopLoopEnd", dtype=int32)
      }
      @tir.vta.coproc_dep_push(2, 1, dtype=int32)
    }
    for (ko: int32, 0, 16) {
      attr [IterVar(vta, (nullptr), "ThreadIndex", "vta")] "coproc_scope" = 1 {
        @tir.vta.coproc_dep_pop(2, 1, dtype=int32)
        @tir.call_extern("VTALoadBuffer2D", @tir.tvm_thread_context(@tir.vta.command_handle(, dtype=handle), dtype=handle), A_2, ko, 1, 1, 1, 0, 0, 0, 0, 0, 2, dtype=int32)
        @tir.call_extern("VTALoadBuffer2D", @tir.tvm_thread_context(@tir.vta.command_handle(, dtype=handle), dtype=handle), B_2, ko, 1, 16, 16, 0, 0, 0, 0, 0, 1, dtype=int32)
        @tir.vta.coproc_dep_push(1, 2, dtype=int32)
      }
      attr [IterVar(vta, (nullptr), "ThreadIndex", "vta")] "coproc_scope" = 2 {
        @tir.vta.coproc_dep_pop(1, 2, dtype=int32)
        attr [IterVar(vta, (nullptr), "ThreadIndex", "vta")] "coproc_uop_scope" = "VTAPushGEMMOp" {
          @tir.call_extern("VTAUopLoopBegin", 16, 1, 0, 1, dtype=int32)
          @tir.vta.uop_push(0, 0, 0, 0, 0, 0, 0, 0, dtype=int32)
          @tir.call_extern("VTAUopLoopEnd", dtype=int32)
        }
        @tir.vta.coproc_dep_push(2, 1, dtype=int32)
      }
    }
    @tir.vta.coproc_dep_push(2, 3, dtype=int32)
    @tir.vta.coproc_dep_pop(2, 1, dtype=int32)
    attr [IterVar(vta, (nullptr), "ThreadIndex", "vta")] "coproc_scope" = 3 {
      @tir.vta.coproc_dep_pop(2, 3, dtype=int32)
      @tir.call_extern("VTAStoreBuffer2D", @tir.tvm_thread_context(@tir.vta.command_handle(, dtype=handle), dtype=handle), 0, 4, C_2, 0, 16, 1, 16, dtype=int32)
    }
    @tir.vta.coproc_sync(, dtype=int32)
  }
}

aca88 · September 21, 2020, 10:01am

hey Thierry thanks for your input.

I know I am doing something unconventional, but I wanted to see how the C code generator behaves for different scenarios.

I am not really sure why the C code generator seems to be such a “bad” option when compared to the llvm target. Maybe you can give me some insight?

Anyways, like I stated before, the https://tvm.apache.org/docs/vta/tutorials/matrix_multiply.html and the https://tvm.apache.org/docs/vta/tutorials/optimize/convolution_opt.html#sphx-glr-vta-tutorials-optimize-convolution-opt-py when compiled (i.e. vta-build(...)) with the ‘c’ code as target works. Well what I mean is that I can generate the C source representation schedule. What is definitely missing are the #include statements which are specific to the VTA (without them the compiler would complain about missing definitions of the VTA runtime functions). These are not there because the TIR->C code generator is agnostic to these “other VTA includes”. In the DNNL example, they insert the required DNNL includes. Something similar would be required in the VTA example.

Does the llvm compilation process not require some guidance as to where the VTA runtime functions are defined?
- are they all in the libtvm.so and this is given somewhere in the llvm compilation process?
- If I print the llvm source, how do I determine that “it will know” where these external functions are?

The error I get (and posted above) happens when I try to compile the complete detection graph using the ‘c’ target. The error mentions that TVMError: Unresolved call Op(tir.round). So I think the round operator has not been implemented in the TIR->C codegenerator. I guess this round operator is in part of the graph which is not “offloaded” to the VTA, but I am not sure.

geyijun · July 28, 2021, 12:44pm

Hi,you can try following code.

def my_c_round_rule(op):
    return tvm.tir.call_pure_extern(op.dtype, "my_c_round", op.args[0])
tvm.target.register_intrin_rule("c", "round", my_c_round_rule, override=True)

harishch4 · July 15, 2022, 11:12am

Are you able to generate C code? I’m trying to do the same and encountered similar problem.