Parallel Execution in LLM

I have been trying to get latencies of the nodes of the compute graph in LLMs. I am using the example of Optimizing LLM, and I am using the VM instrument to get the time evaluations of functional operators of each stage of deployment.

    if self.time_eval and name not in self.time_eval_results:
            res = self.mod.time_evaluator(
                name,
                self.device,
                number=20,
                repeat=3,
                min_repeat_ms=100,
                # cache_flush_bytes=256 * 10**6
            )(*new_args)
            self.time_eval_results[name] = (res.mean, 1)
            self.node_arguments[name] = ref_args_np
            print(f"Time-eval result {name} on {self.device}:\n {res}")

However in the cases of Prefill stage which involves a lot of parallel processing I cannot get an accurate understanding of the overall latency just by looking at the latencies of the functional operators.

    start_time = time.time()
    logits, kv_caches = self._prefill(embedding, input_len)
    prefill_time = (time.time() - start_time) * 1000

In this case the implementation on a kernel level would be very useful but I am struggling with it. Is there a way to see the kernel level implementation from the IR which can be used to make a prediction for the function latency?