[microTVM] End-to-End Inference Timing with AoTExecutor

Hi all,

short question: Is there an example how to measure the inference time of workloads with the mictoTVM AoT Executor? The old blog post benchmark seems to be deprecated w.r.t to the latest microTVM developments.

When checking the generated code, there seem to be timing functions available, but the existing module.benchmark() is not supported.

PS: Using the GraphDebugExecutor is not an option.

@areusch this is probably your ballpark ? :wink:

You should take a look at the helper functions in tvm.micro.testing.evaluation - they’re built for this exact use case.

If you don’t want to use the helper functions, you could also use time_evaluator with AOT (which is what the helper functions do underneath):

# project = tvm.micro.generate_project...
# sample = ...

# Build and flash the project, and establish a session
project.build()
project.flash()
session = tvm.micro.Session(project.transport())
aot_executor = tvm.runtime.executor.aot_executor.AotModule(session)

# Load the sample and check runtime
aot_executor.get_input(0).copyfrom(sample)
result = aot_executor.module.time_evaluator("run", session.device)()
output = aot_executor.get_output(0).numpy()
runtime = result.mean # runtime is given as a float in seconds
1 Like

Thanks for the reply @guberti. After trying both, the helper functions and the time_evaluator, both result in a timeout:

[14:52:14] /tvmpath/tvm/src/runtime/micro/micro_session.cc:368: remote: microTVM Zephyr runtime - running
[14:52:14] /tvmpath/tvm/src/runtime/micro/micro_session.cc:368: remote: DumpMetadata:
[14:52:14] /tvmpath/tvm/src/runtime/micro/micro_session.cc:368: remote:    mod_name=tvmgen_default
[14:52:14] /tvmpath/tvm/src/runtime/micro/micro_session.cc:368: remote:    version=1
[14:52:14] /tvmpath/tvm/src/runtime/micro/micro_session.cc:368: remote:    num_inputs=2
[14:52:14] /tvmpath/tvm/src/runtime/micro/micro_session.cc:368: remote:    num_outputs=1
[14:52:14] /tvmpath/tvm/src/runtime/micro/micro_session.cc:368: remote:    num_workspace_pools=0
[14:52:14] /tvmpath/tvm/src/runtime/micro/micro_session.cc:368: remote:    num_constant_pools=0
[14:52:14] /tvmpath/tvm/src/runtime/micro/micro_session.cc:368: remote:    input[0]: data
[14:52:14] /tvmpath/tvm/src/runtime/micro/micro_session.cc:368: remote:    input[1]: weight1
[14:52:14] /tvmpath/tvm/src/runtime/micro/micro_session.cc:368: remote:    output[0]: output0
[14:52:14] /tvmpath/tvm/src/runtime/micro/micro_session.cc:368: remote: input allocate[0]: data
[14:52:14] /tvmpath/tvm/src/runtime/micro/micro_session.cc:368: remote: input allocate[1]: weight1
[14:52:14] /tvmpath/tvm/src/runtime/micro/micro_session.cc:368: remote: output allocate[0]: output0

Traceback (most recent call last):
  File "micro_autotune.py", line 283, in <module>
  timing = aot_executor.module.time_evaluator("run", session.device)()
File "/tvmpath/tvm/python/tvm/runtime/module.py", line 357, in evaluator
 blob = feval(*args)
File "/tvmpath/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
  raise get_last_ffi_error()
tvm._ffi.base.TVMError: MicroSessionTimeoutError: failed to read reply message after timeout 5s

This issue appears even for trivial networks (single layer, like in the micro_autotune.py example).

I also started to play around with the testing.micro.zephyr.test_utils to try to evaluate a model using the aot_standalone_demo. The init_wait works and recieves a response from the board (wakeup\n), as it should. However, retrieving the results also fails with timeout.

There is clearly some inconsistency in how things are setup. On my STM32h743zi some functionality works:

  • AutoTuning works
  • AoTExecutor without the time_evaluator() works. Correct results are also transmitted back to the host.

However

  • DebugGraphExecutor fails for anything that is beyond a trivial network (is suspect issues with pre-quantized tflite models)
  • AoTExecutor.time_evaluator() fails with timeout.

Multiple times when reporting the latency back to the host, I get timeout issues (even for trivial programs) except for AutoTuning, which seems to work as intended.

What would be the best places to start digging into this communication @guberti @areusch ?

To clarify, are you using microTVM with a Zephyr board (e.g. the NUCLEO H743ZI)? If so, which one? It’s possible you’re running out of stack space on your board.

Here’s a more thorough example showing how a project_options parameter can be passed to create_aot_session to manually set the stack size. I’ve verified this works on the NUCLEO-L4R5ZI - let me know what output you see.

import numpy as np
from tvm.runtime.executor.aot_executor import AotModule
from tvm.testing.utils import fetch_model_from_url
from tvm.micro.testing import create_aot_session, get_target, predict_labels_aot

PLATFORM = "zephyr"
# Change "nucleo_l4r5zi" to whatever your dev board is!
BOARD = "nucleo_l4r5zi"

URL = "https://github.com/tensorflow/tflite-micro/raw/main/tensorflow/lite/micro/examples/micro_speech/micro_speech.tflite"
SHA256 = "09e5e2a9dfb2d8ed78802bf18ce297bff54281a66ca18e0c23d69ca14f822a83"
mod, params = fetch_model_from_url(URL, "tflite", SHA256)


target = get_target(PLATFORM, BOARD)
options = {"config_main_stack_size": 4096}

with create_aot_session(
    PLATFORM, BOARD, target, mod, params, project_options=options
) as session:
    aot_executor = AotModule(session.create_aot_executor())
    sample = np.random.randint(low=-127, high=128, size=(1, 1960), dtype=np.int8)
    predictions = predict_labels_aot(session, aot_executor, [sample], runs_per_sample=10)

    _label, runtime = next(predictions)
    assert runtime < 1
    print(runtime)

Note the runtime we get will be pretty slow, as we aren’t doing any autotuning. If we liked, we could fix this by passing a tune_logs keyword argument to create_aot_session.

1 Like

Now it works, thanks!

However, what bugs me a bit is the setup. Because when using the code you posted in the previous example, it would always fail with a timeout issue (even though I used enough memory). Only when using the wrapper functions you provided in the second example I get it to work. How to work with these interfaces is not well documented, in my opinion.

Glad I could help!

You’re right about time_evaluator being poorly documented - it irks me how this function takes literally nine parameters, and how it requires you to guess the name of the internal function that does the thing you want (in this case, "run").

That said, the code I posted in the first example should work, so long as the Zephyr stack size is set correctly. If you’re willing to share a code snippet that reproduces the problem, I would love to take a look.