I’m not as familiar with the Zephyr bindings, but microTVM also supports (via Arduino) the Teensy 4.0 and 4.1, which have the same IMXRT1060 chip as your mimxrt1060_evk, and I’ve used TVM with both in the past.
To help reproduce the problem, I wrote up a short script on Colab that builds mobilenet_v1_0.25_128_quant.tflite for the IMXRT1060, using microTVM’s Arduino bindings and the AOT executor. Using the AOT executor instead of the graph executor allows us to detect the inevitable memory overflow at compile time, instead of at runtime.
First, it should not be necessary to use the 'tir.usmp.enable': True, just to get a board working. Instead, your issue is where your variables are being stored. To briefly summarize the datasheet, the IMXRT1060 has 1 MB OCRAM (on-chip RAM), which is divided into 512 KB of tightly coupled ITCM/DTCM (of which the ITCM/DTCM split can be specified in 32 KB increments), as well as 512 KB of regular, general purpose OCRAM. Along with 128 KB of boot ROM (which isn’t relevant here), this is all the memory on the chip.
Importantly, this means the IMXRT1060 has no flash memory onboard (instead, it is located on an external chip and connected via one of memory interfaces). This memory layout means the IMXRT1060’s compiler can often make interesting (bad) choices of where to store variables and memory, and will try really hard to put all variables, including static ones, into TCRAM (tightly-coupled RAM, the first 512 KB block mentioned above). This is what’s causing your memory issue - the compiler is trying to store all variables inside DTCM, which can be at most 512 KB (and is in practice less).
The solution? Store the variables in places that make sense. Static variables (like the model weights in default_lib2.c) should go on the external flash chip, which can be done by specifying PROGMEM. Our AOT memory array should go in regular OCRAM, which can be done by specifying DMAMEM, e.g.
DMAMEM uint8_t g_aot_memory[WORKSPACE_SIZE]
__attribute__((aligned(TVM_RUNTIME_ALLOC_ALIGNMENT_BYTES)));
You can easily verify that these changes help - the example above in Colab tells you the amount by which the DTCM region of memory is exceeded. Making either one of these changes will decrease the amount by which we exceed it, and making both will solve the problem.
This is not a good long-term solution, however, as it requires you to mess with the compiler’s output. @alanmacd is doing work on memory pools that will let us make these decisions automatically, but it will not be ready for a little while.