Extracting a model library in TVM-Unity

Neoxion · November 16, 2023, 4:34pm

Hello,

Is it possible to extract a TVM Model library format https://tvm.apache.org/docs/arch/model_library_format.html from TVM-Unity? For regular DNNs it was simple with microTVM:

RUNTIME = Runtime("crt")
TARGET = tvm.target.Target("c -device=cpu")
with tvm.transform.PassContext(opt_level=3, config=config):
    built_model = tvm.relay.build(
        relay_mod, target=TARGET, params=params, runtime=RUNTIME, executor=EXECUTOR
    )
export_model_library_format(built_model, TAR_PATH)

Then I was creating and compiling a standalone executable from the generated C-code, that could produce an output from a specific input.

I was experimenting with LLama2 (from HF, not prebuilt) and q4f16_1 quantization, but got only errors so far. I was able to compile and run the MLC-Chat in cuda. Then I tried:

python local_build.py --model /path/to/llama2 --target c --quantization q4f16_1

local_build.py is connected to local_core.py, which are replicas from the mlc-llm repo. I get an error:

> tvm._ffi.base.TVMError: Traceback (most recent call last):
>   150: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::Module (tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)>::AssignTypedLambda<tvm::{lambda(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)#6}>(tvm::{lambda(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)#6}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
>   149: tvm::TIRToRuntime(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target const&)
>   148: tvm::codegen::Build(tvm::IRModule, tvm::Target)
> ...........
>   0: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)
>   File "/.../tvm-unity/src/target/source/codegen_c.cc", line 673
> TVMError: Unresolved call Op(tir.call_llvm_pure_intrin)

Then I tried:

python local_build.py --model /path/to/llama2 --target llvm --quantization q4f16_1

It worked and generated a “…-cpu.so” library. This library doesn’t work in MLC-Chat and throws an error:

> tvm._ffi.base.TVMError: Traceback (most recent call last):
>   9: mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#5}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
>         at /workspace/mlc-llm/cpp/llm_chat.cc:1487
>   8: mlc::llm::LLMChat::PrefillStep(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, bool, mlc::llm::PlaceInPrompt, tvm::runtime::String)
>         at /workspace/mlc-llm/cpp/llm_chat.cc:842
>   7: mlc::llm::LLMChat::ForwardTokens(std::vector<int, std::allocator<int> >, long)
>         at /workspace/mlc-llm/cpp/llm_chat.cc:1198
>   6: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
>   5: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
>   4: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator<tvm::runtime::TVMRetValue> > const&)
>   3: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()
>   2: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)
>   1: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
>   0: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::WrapPackedFunc(int (*)(TVMValue*, int*, int, TVMValue*, int*, void*), tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
>   File "/.../tvm-unity/src/runtime/library_module.cc", line 78
> TVMError: Assert fail: T.tvm_struct_get(p_lv, 0, 10, "int32") == 1, Argument fused_fused_decode1_take.p_lv.device_type has an unsatisfied constraint: 1 == T.tvm_struct_get(p_lv, 0, 10, "int32")

I still tried to export_model_library_format() by adding it to local_core.py, but got an error:

>   File "/.../python3.8/site-packages/tvm/micro/model_library_format.py", line 663, in export_model_library_format
>     raise NotImplementedError(
> NotImplementedError: Don't know how to export module of type <class 'tvm.relax.vm_build.Executable'>

, which is understandable.

The target=c,llvm was never covered in any examples and I’ve read that the main focus of MLC-LLM is on GPUs. I only found this possibility when going through the github code. Is there a work-around to make this work? Or at, least to extract a Model Library for a non-dynamic input (without Relax), but for a quantized language model?

P.S. I’m aware of llama.cpp, which can probably give what I need, but I wanted to try this in TVM first.

fPecc · February 6, 2024, 3:04pm

Hi @Neoxion ,

Did you found a solution to your problem? I am also interested in getting a MLF for a LLM.

tqchen · February 6, 2024, 4:25pm

For LLM it is harder to have a standalone execution, but likely mlc-llm runtime would be the best bet (referencing the iOS/android variants). Modulo c++ requirement the relax runtime can still be pretty minimum.