MLC-LLM VM Profiler

Hi there,

I am trying to get llama-2 over mlc_chat_cli to work with VirtualMachineProfiler to get per op profiling. Judging from the reply here, this should be possible.

So far, I have changed the following:

auto fload_exec = executable->GetFunction("vm_profiler_load_executable");
// [...]
PackedFunc profile_func_ = vm_->GetFunction("profile");
// [...]

Then, in order to profile an operation, from what I understand (from here), I need to call the profile function with the first argument being the function name I want to profile and the rest of the arguments being as they were. However, so far I have had no real success.

I have tested against the decode function, as well as softmax_with_temperature. But I am getting issues wrt the passed arguments.

The code I am trying looks like this:

NDArray Softmax(NDArray input, float temperature) {
// [...]
    tvm::runtime::profiling::Report report = profile_func_("softmax_with_temperature", input, 
        temperature_arr);
    std::cout << report->AsTable() << "\n";

The error I am getting is:

mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:1890: InternalError: Check failed: type_code_ == kTVMObjectHandle (11 vs. 8) : expected Object but got str

My questions:

  1. Is there a way to get a per op report for the whole execution graph? If so, how?
  2. If not, do I need to do the profiling per PackedFunction exposed through the executable?
  3. In that case, am I doing something wrong wrt the passed arguments?

Thanks in advance.

To get per op profiling info,u can initialize u VM by doing like this

vm=rx.VirtualMachine(exe,dev,profile=True)
then u can get do profiling

s=vm.module["profile"]("FuncToProfile",input_args)

OR

vm.profile("FuncToProfile",input_args)

Thanks @JackWw for the reply. However, your answer describes the Python API.

I am using the C++ API straight from the mlc_chat_cli app (from here).

Try passing PackedFunc softmax_func_ which is the member of LLMChat instread of string “softmax_with_temperature”

If u want to get a per op report, try script in here

@JackWw Could you please post per op report script again? The link is not there anymore. Thanks!

evaluate.py was deleted but you can find it by checking git history

I have managed to make this work with the vm_profiler and profile function as a wrapper over the individual operations (e.g. prefill, softmax, etc.). I am passing the function names as strings.

While this works ok for the case of M1 (metal), Android seems to be missing events. Is this something you are aware of?

@specter Could you post profile code which works for you? I have similar issue now?

I tried to add profiler into llm_chat.cc like below NDArray Softmax(NDArray input, NDArray temperature_arr) { NDArray ret; tvm::runtime::profiling::Report report = ft_.profile_func_(ft_.softmax_func_, input, ** temperature_arr);** std::cout<<"Softmax function "<<std::endl; try { ret = ft_.softmax_func_(input, temperature_arr); } catch (const dmlc::Error& e) { // This branch is for compatibility: // The old softmax function takes temperature arr with shape (), // and the new softmax func takes temperature arr with shape (1,). // Remove this branch after updating all prebuilt model libraries. temperature_arr = temperature_arr.CreateView({}, temperature_arr->dtype); ret = ft_.softmax_func_(input, temperature_arr); } std::cout << report->AsTable() << std::endl; return ret; } I don’t see it get called there when I run mlc_chat_cli application. I used openCL/GPU, Could you tell me how you did it?

Thanks Yi