I am trying to get llama-2 over mlc_chat_cli to work with VirtualMachineProfiler to get per op profiling.
Judging from the reply here, this should be possible.
Then, in order to profile an operation, from what I understand (from here), I need to call the profile function with the first argument being the function name I want to profile and the rest of the arguments being as they were. However, so far I have had no real success.
I have tested against the decode function, as well as softmax_with_temperature. But I am getting issues wrt the passed arguments.
I have managed to make this work with the vm_profiler and profile function as a wrapper over the individual operations (e.g. prefill, softmax, etc.). I am passing the function names as strings.
While this works ok for the case of M1 (metal), Android seems to be missing events. Is this something you are aware of?
I tried to add profiler into llm_chat.cc like below
NDArray Softmax(NDArray input, NDArray temperature_arr) {
NDArray ret;
tvm::runtime::profiling::Report report = ft_.profile_func_(ft_.softmax_func_, input,
** temperature_arr);**
std::cout<<"Softmax function "<<std::endl;
try {
ret = ft_.softmax_func_(input, temperature_arr);
} catch (const dmlc::Error& e) {
// This branch is for compatibility:
// The old softmax function takes temperature arr with shape (),
// and the new softmax func takes temperature arr with shape (1,).
// Remove this branch after updating all prebuilt model libraries.
temperature_arr = temperature_arr.CreateView({}, temperature_arr->dtype);
ret = ft_.softmax_func_(input, temperature_arr);
}
std::cout << report->AsTable() << std::endl;
return ret;
}
I don’t see it get called there when I run mlc_chat_cli application. I used openCL/GPU, Could you tell me how you did it?