Targetting 'c' in VTA's deployment example ==> TVMError: Unresolved call Op(tir.round)

Hello,

I wanted to get a better view of what the output of compiling a network in the VTA deployment example is.

More specifically, I wanted to get the C source after building the module. Trying to read this output in LLVM was giving me headaches.

For that I changed the ‘llvm’ target when using fsim to ‘c’.

The compilation throws the following error:

Exception has occurred: TVMError
Traceback (most recent call last):
  [bt] (8) /home/tvm/build/libtvm.so(tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::CastNode const*, std::ostream&)+0x227) [0x7f653bb80e27]
  [bt] (7) /home/tvm/build/libtvm.so(tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)+0xad) [0x7f653bb7beed]
  [bt] (6) /home/tvm/build/libtvm.so(void tvm::codegen::CodeGenCHost::PrintTernaryCondExpr<tvm::tir::MaxNode>(tvm::tir::MaxNode const*, char const*, std::ostream&)+0x57) [0x7f653bb92527]
  [bt] (5) /home/tvm/build/libtvm.so(tvm::tir::ExprFunctor<void (tvm::PrimExpr const&, std::ostream&)>::VisitExpr(tvm::PrimExpr const&, std::ostream&)+0x7c) [0x7f653bb8a97c]
  [bt] (4) /home/tvm/build/libtvm.so(void tvm::codegen::CodeGenCHost::PrintTernaryCondExpr<tvm::tir::MinNode>(tvm::tir::MinNode const*, char const*, std::ostream&)+0x57) [0x7f653bb920b7]
  [bt] (3) /home/tvm/build/libtvm.so(tvm::tir::ExprFunctor<void (tvm::PrimExpr const&, std::ostream&)>::VisitExpr(tvm::PrimExpr const&, std::ostream&)+0x7c) [0x7f653bb8a97c]
  [bt] (2) /home/tvm/build/libtvm.so(tvm::codegen::CodeGenCHost::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)+0x70) [0x7f653bb8f420]
  [bt] (1) /home/tvm/build/libtvm.so(tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)+0x3a2) [0x7f653bb82262]
  [bt] (0) /home/tvm/build/libtvm.so(+0x11fc7b2) [0x7f653bb787b2]
  File "/home/tvm/src/target/source/codegen_c.cc", line 649
TVMError: Unresolved call Op(tir.round)
  File "/home/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 225, in __call__
    raise get_last_ffi_error()
  File "/home/tvm/python/tvm/relay/build_module.py", line 121, in build
    self._build(mod, target, target_host)
  File "/home/tvm/python/tvm/relay/build_module.py", line 255, in build
    graph_json, mod, params = bld_mod.build(mod, target, target_host, params)
  File "/home/tvm/vta/tutorials/frontend/deploy_classification.py", line 200, in <module>
    params=params, target_host=env.target_host)

As a side note, the conv2d optimization tutorial for VTA with the same change to the fsim target (i.e. ‘c’ instead of ‘llvm’) does work. But the C file does not include the VTA specific include statements.

#include "tvm/runtime/c_runtime_api.h"
#include "tvm/runtime/c_backend_api.h"
void* __tvm_module_ctx = NULL;
static void* __tvm_set_device_packed = NULL;
#ifdef __cplusplus
extern "C"
#endif
//Rest of C code

cc @thierry any thoughts on the matter?

Hi @aca88, I haven’t tried to change the VTA codegen from llvm to c target.

Perhaps one good way to get an understanding of the VTA codegen path and how the runtime API (https://github.com/apache/incubator-tvm/blob/master/vta/runtime/runtime.h) is being invoked is to print out the lowered TVM IR.

The matrix multiply example (https://tvm.apache.org/docs/vta/tutorials/matrix_multiply.html) should show you how to obtain a detailed IR dump:

primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
  attr = {"global_symbol": "main", "tir.noalias": True}
  buffers = {C: Buffer(C_2: Pointer(int8), int8, [1, 16, 1, 16], []),
             B: Buffer(B_2: Pointer(int8), int8, [16, 16, 16, 16], []),
             A: Buffer(A_2: Pointer(int8), int8, [1, 16, 1, 16], [])}
  buffer_map = {A_1: A, B_1: B, C_1: C} {
  attr [C_buf: Pointer(int32)] "storage_scope" = "local.acc_buffer";
  attr [A_buf: Pointer(int8)] "storage_scope" = "local.inp_buffer";
  attr [B_buf: Pointer(int8)] "storage_scope" = "local.wgt_buffer" {
    attr [IterVar(vta: int32, (nullptr), "ThreadIndex", "vta")] "coproc_scope" = 2 {
      attr [IterVar(vta, (nullptr), "ThreadIndex", "vta")] "coproc_uop_scope" = "VTAPushGEMMOp" {
        @tir.call_extern("VTAUopLoopBegin", 16, 1, 0, 0, dtype=int32)
        @tir.vta.uop_push(0, 1, 0, 0, 0, 0, 0, 0, dtype=int32)
        @tir.call_extern("VTAUopLoopEnd", dtype=int32)
      }
      @tir.vta.coproc_dep_push(2, 1, dtype=int32)
    }
    for (ko: int32, 0, 16) {
      attr [IterVar(vta, (nullptr), "ThreadIndex", "vta")] "coproc_scope" = 1 {
        @tir.vta.coproc_dep_pop(2, 1, dtype=int32)
        @tir.call_extern("VTALoadBuffer2D", @tir.tvm_thread_context(@tir.vta.command_handle(, dtype=handle), dtype=handle), A_2, ko, 1, 1, 1, 0, 0, 0, 0, 0, 2, dtype=int32)
        @tir.call_extern("VTALoadBuffer2D", @tir.tvm_thread_context(@tir.vta.command_handle(, dtype=handle), dtype=handle), B_2, ko, 1, 16, 16, 0, 0, 0, 0, 0, 1, dtype=int32)
        @tir.vta.coproc_dep_push(1, 2, dtype=int32)
      }
      attr [IterVar(vta, (nullptr), "ThreadIndex", "vta")] "coproc_scope" = 2 {
        @tir.vta.coproc_dep_pop(1, 2, dtype=int32)
        attr [IterVar(vta, (nullptr), "ThreadIndex", "vta")] "coproc_uop_scope" = "VTAPushGEMMOp" {
          @tir.call_extern("VTAUopLoopBegin", 16, 1, 0, 1, dtype=int32)
          @tir.vta.uop_push(0, 0, 0, 0, 0, 0, 0, 0, dtype=int32)
          @tir.call_extern("VTAUopLoopEnd", dtype=int32)
        }
        @tir.vta.coproc_dep_push(2, 1, dtype=int32)
      }
    }
    @tir.vta.coproc_dep_push(2, 3, dtype=int32)
    @tir.vta.coproc_dep_pop(2, 1, dtype=int32)
    attr [IterVar(vta, (nullptr), "ThreadIndex", "vta")] "coproc_scope" = 3 {
      @tir.vta.coproc_dep_pop(2, 3, dtype=int32)
      @tir.call_extern("VTAStoreBuffer2D", @tir.tvm_thread_context(@tir.vta.command_handle(, dtype=handle), dtype=handle), 0, 4, C_2, 0, 16, 1, 16, dtype=int32)
    }
    @tir.vta.coproc_sync(, dtype=int32)
  }
}

hey Thierry thanks for your input.

I know I am doing something unconventional, but I wanted to see how the C code generator behaves for different scenarios.

  • I am not really sure why the C code generator seems to be such a “bad” option when compared to the llvm target. Maybe you can give me some insight?

Anyways, like I stated before, the https://tvm.apache.org/docs/vta/tutorials/matrix_multiply.html and the https://tvm.apache.org/docs/vta/tutorials/optimize/convolution_opt.html#sphx-glr-vta-tutorials-optimize-convolution-opt-py when compiled (i.e. vta-build(...)) with the ‘c’ code as target works. Well what I mean is that I can generate the C source representation schedule. What is definitely missing are the #include statements which are specific to the VTA (without them the compiler would complain about missing definitions of the VTA runtime functions). These are not there because the TIR->C code generator is agnostic to these “other VTA includes”. In the DNNL example, they insert the required DNNL includes. Something similar would be required in the VTA example.

  • Does the llvm compilation process not require some guidance as to where the VTA runtime functions are defined?
    • are they all in the libtvm.so and this is given somewhere in the llvm compilation process?
    • If I print the llvm source, how do I determine that “it will know” where these external functions are?

The error I get (and posted above) happens when I try to compile the complete detection graph using the ‘c’ target. The error mentions that TVMError: Unresolved call Op(tir.round). So I think the round operator has not been implemented in the TIR->C codegenerator. I guess this round operator is in part of the graph which is not “offloaded” to the VTA, but I am not sure.