Hi,
I’m trying to deploy the ViT-based PyTorch model into CYSBSYSKIT-DEV-01, which is an m4 MCU. I can already successfully flash the program with the template model by following this tutorial. However when I was trying to use my own model. I guess there are some problems with the memory. Intuitively, I was guessing maybe this issue resulted from the ViT model itself since the way we computed the multi-head attention could potentially consume lots of memory (the query, key, and value stuff…).
This is the model:
=========================================================================================================
Layer (type:depth-idx) Output Shape Param #
=========================================================================================================
ResNetTransformer [1, 2] --
├─Sequential: 1-1 [1, 257, 8] --
│ └─Conv2d: 2-1 [1, 16, 128, 128] 144
│ └─BatchNorm2d: 2-2 [1, 16, 128, 128] 32
│ └─ReLU: 2-3 [1, 16, 128, 128] --
│ └─MaxPool2d: 2-4 [1, 16, 128, 64] --
│ └─ResidualStack: 2-5 [1, 16, 128, 64] --
│ │ └─Sequential: 3-1 [1, 16, 128, 64] 4,672
│ └─ResidualStack: 2-6 [1, 8, 64, 64] --
│ │ └─Sequential: 3-2 [1, 8, 64, 64] 1,760
│ └─PatchEmbedding: 2-7 [1, 257, 8] 2,064
│ │ └─Sequential: 3-3 [1, 256, 8] 1,032
│ └─TransformerEncoder: 2-8 [1, 257, 8] --
│ │ └─TransformerEncoderBlock: 3-4 [1, 257, 8] 600
├─ClassificationHead: 1-2 [1, 2] --
│ └─Reduce: 2-9 [1, 8] --
│ └─LayerNorm: 2-10 [1, 8] 16
│ └─Linear: 2-11 [1, 2] 18
├─Softmax: 1-3 [1, 2] --
=========================================================================================================
Total params: 10,338
Trainable params: 10,338
Non-trainable params: 0
Total mult-adds (M): 47.45
=========================================================================================================
Input size (MB): 0.07
Forward/backward pass size (MB): 9.60
Params size (MB): 0.03
Estimated Total Size (MB): 9.70
=========================================================================================================
This is how I transform the model:
mod, params = relay.frontend.from_pytorch(scripted_model, shape_list)
# We can use TVM native schedules or rely on the CMSIS-NN kernels using TVM Bring-Your-Own-Code (BYOC) capability.
USE_CMSIS_NN = False
# USMP (Unified Static Memory Planning) performs memory planning of all tensors holistically to achieve best memory utilization
DISABLE_USMP = False
# Use the C runtime (crt)
RUNTIME = Runtime("crt")
# We define the target by passing the board name to `tvm.target.target.micro`.
# If your board is not included in the supported models, you can define the target such as:
TARGET = tvm.target.Target("c -keys=arm_cpu,cpu -mcpu=cortex-m4")
# TARGET = tvm.target.target.micro("stm32l4r5zi")
# Use the AOT executor rather than graph or vm executors. Use unpacked API and C calling style.
EXECUTOR = tvm.relay.backend.Executor(
"aot", {"unpacked-api": True, "interface-api": "c", "workspace-byte-alignment": 8}
)
# Now, we set the compilation configurations and compile the model for the target:
config = {"tir.disable_vectorize": True}
if USE_CMSIS_NN:
config["relay.ext.cmsisnn.options"] = {"mcpu": TARGET.mcpu}
if DISABLE_USMP:
config["tir.usmp.enable"] = False
relay_mod = mod
with tvm.transform.PassContext(opt_level=3, config=config):
if USE_CMSIS_NN:
# When we are using CMSIS-NN, TVM searches for patterns in the
# relay graph that it can offload to the CMSIS-NN kernels.
relay_mod = cmsisnn.partition_for_cmsisnn(relay_mod, params, mcpu=TARGET.mcpu)
lowered = tvm.relay.build(
relay_mod, target=TARGET, params=params, runtime=RUNTIME, executor=EXECUTOR
)
parameter_size = len(tvm.runtime.save_param_dict(lowered.get_params()))
print(f"Model parameter size: {parameter_size}")
# We need to pick a directory where our file will be saved.
# If running on Google Colab, we'll save everything in ``/root/tutorial`` (aka ``~/tutorial``)
# but you'll probably want to store it elsewhere if running locally.
BUILD_DIR = pathlib.Path("output")
BUILD_DIR.mkdir(exist_ok=True)
# Now, we export the model into a tar file:
TAR_PATH = pathlib.Path(BUILD_DIR) / f"{name}.tar"
print(TAR_PATH)
export_model_library_format(lowered, TAR_PATH)
My Pytorch model has only 41 MB (refer to the c codegen default_lib0.c, and this quite corresponds to the estimated memory size that I saw in the torch summary), but the global_workspace seems to use 3.9 GB. My questions are the following:
-
As far as I understood according to other posts, the AoT Executor let us allocate memory at “compile time”, and it seems that the memory would only be freed after inference. Am I understanding it correctly?
-
Is there a way that I can reduce the global_workspace memory size? Is it recommended to modify the codegen to allow memory reusability at inference time? (If my first guess is right…) If someone ever did this kind of implementation or faced similar problems like me, I would love to talk about this.
I look forward to your reply! Thanks!