[Question] why bert base gpu float32 latency are same in batch 1, 16, 64?

start = time.time()
m.run()
end = time.time()

i measure the time run, but the latencies are same. is module starting run when set_input?

1 batch cost 0.002 seconds, 16 batch 0.00213 seconds, 64 batch 0.00245 seconds…? it seems so weird.

the statistics are measured in T4.

You should also include the time of getting the output to make sure the device is synced, as we don’t have a device barrier API on the Python side.