Hello,
Is it possible to extract a TVM Model library format Model Library Format — tvm 0.15.dev0 documentation from TVM-Unity? For regular DNNs it was simple with microTVM:
RUNTIME = Runtime("crt") TARGET = tvm.target.Target("c -device=cpu") with tvm.transform.PassContext(opt_level=3, config=config): built_model = tvm.relay.build( relay_mod, target=TARGET, params=params, runtime=RUNTIME, executor=EXECUTOR ) export_model_library_format(built_model, TAR_PATH)
Then I was creating and compiling a standalone executable from the generated C-code, that could produce an output from a specific input.
I was experimenting with LLama2 (from HF, not prebuilt) and q4f16_1 quantization, but got only errors so far. I was able to compile and run the MLC-Chat in cuda. Then I tried:
python local_build.py --model /path/to/llama2 --target c --quantization q4f16_1
local_build.py is connected to local_core.py, which are replicas from the mlc-llm repo. I get an error:
> tvm._ffi.base.TVMError: Traceback (most recent call last):
> 150: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::Module (tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)>::AssignTypedLambda<tvm::{lambda(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)#6}>(tvm::{lambda(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)#6}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
> 149: tvm::TIRToRuntime(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target const&)
> 148: tvm::codegen::Build(tvm::IRModule, tvm::Target)
> ...........
> 0: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)
> File "/.../tvm-unity/src/target/source/codegen_c.cc", line 673
> TVMError: Unresolved call Op(tir.call_llvm_pure_intrin)
Then I tried:
python local_build.py --model /path/to/llama2 --target llvm --quantization q4f16_1
It worked and generated a “…-cpu.so” library. This library doesn’t work in MLC-Chat and throws an error:
> tvm._ffi.base.TVMError: Traceback (most recent call last): > 9: mlc::llm::LLMChatModule::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#5}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const > at /workspace/mlc-llm/cpp/llm_chat.cc:1487 > 8: mlc::llm::LLMChat::PrefillStep(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, bool, mlc::llm::PlaceInPrompt, tvm::runtime::String) > at /workspace/mlc-llm/cpp/llm_chat.cc:842 > 7: mlc::llm::LLMChat::ForwardTokens(std::vector<int, std::allocator<int> >, long) > at /workspace/mlc-llm/cpp/llm_chat.cc:1198 > 6: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) > 5: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) > 4: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator<tvm::runtime::TVMRetValue> > const&) > 3: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop() > 2: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction) > 1: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) > 0: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::WrapPackedFunc(int (*)(TVMValue*, int*, int, TVMValue*, int*, void*), tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) > File "/.../tvm-unity/src/runtime/library_module.cc", line 78 > TVMError: Assert fail: T.tvm_struct_get(p_lv, 0, 10, "int32") == 1, Argument fused_fused_decode1_take.p_lv.device_type has an unsatisfied constraint: 1 == T.tvm_struct_get(p_lv, 0, 10, "int32")
I still tried to export_model_library_format() by adding it to local_core.py, but got an error:
> File "/.../python3.8/site-packages/tvm/micro/model_library_format.py", line 663, in export_model_library_format > raise NotImplementedError( > NotImplementedError: Don't know how to export module of type <class 'tvm.relax.vm_build.Executable'>
, which is understandable.
The target=c,llvm was never covered in any examples and I’ve read that the main focus of MLC-LLM is on GPUs. I only found this possibility when going through the github code. Is there a work-around to make this work? Or at, least to extract a Model Library for a non-dynamic input (without Relax), but for a quantized language model?
P.S. I’m aware of llama.cpp, which can probably give what I need, but I wanted to try this in TVM first.