[bug][runtime] A runtime bug when running two instances of the same NN model

samwyi · September 24, 2021, 9:34pm

An app can run two instances of the same NN model. In such situation, only one dll is loaded into memory, but two ModuleNodes are created, one for each instance. In CreateModuleFromLibrary(), the runtime modifies the dll’s content at symbol location runtime::symbol::tvm_module_ctx to make it point to the ModuleNode. Since we have only one copy of the dll, the pointer at runtime::symbol::tvm_module_ctx is overwritten by the later loaded model. Both models end up using the same ModuleNode created by the later loaded model.

For Vulkan runtime, this causes Vulkan error because the Vulkan pipeline for each shader is cached inside the ModuleNode. The two models end up sharing the same pipeline. When one model’s shader is running on the GPU, the other model tries to update the shader’s input.

I met this problem in Vulkan runtime. Not sure if any other runtimes have similar problem.

masahi · September 25, 2021, 6:49am

I don’t think running two instances of the same model is safe. set_input, run etc are not thread safe. Is creating the same model multiple times acceptable? That’s what I did when I wanted to run multiple inference over different data in parallel.

samwyi · September 27, 2021, 4:33pm

Yes, renaming the second model dll to a different name can avoid this problem. The memory print would be larger, but this seems to be an easy workaround for now.