I have been trying to get latencies of the nodes of the compute graph in LLMs. I am using the example of Optimizing LLM, and I am using the VM instrument to get the time evaluations of functional operators of each stage of deployment.
if self.time_eval and name not in self.time_eval_results:
res = self.mod.time_evaluator(
name,
self.device,
number=20,
repeat=3,
min_repeat_ms=100,
# cache_flush_bytes=256 * 10**6
)(*new_args)
self.time_eval_results[name] = (res.mean, 1)
self.node_arguments[name] = ref_args_np
print(f"Time-eval result {name} on {self.device}:\n {res}")
However in the cases of Prefill stage which involves a lot of parallel processing I cannot get an accurate understanding of the overall latency just by looking at the latencies of the functional operators.
start_time = time.time()
logits, kv_caches = self._prefill(embedding, input_len)
prefill_time = (time.time() - start_time) * 1000
In this case the implementation on a kernel level would be very useful but I am struggling with it. Is there a way to see the kernel level implementation from the IR which can be used to make a prediction for the function latency?