Currently, VM PooledAllocator releases its memory only when the underlying device fails to allocate more memory: tvm/pooled_allocator.h at 553778885388a9eff4d611e1022baecd75c69088 · apache/tvm · GitHub. This causes a program crash when doing repeated inferences with dynamic batch size. See [Bug] PyTorch MaskRCNN GPU OOM error · Issue #8233 · apache/tvm · GitHub for a minimal repro.
It seems there are two issues with it:
-
AllocDataSpacecan be called outside ofPooledAllocator, byNDArray::Empty(...)tvm/ndarray.cc at 4d9bc9b4a3e9e8d3420efe60a52964fcd4c29c8d · apache/tvm · GitHub. That call is not protected by try/catch, so if almost all memory are held byPooledAllocatorandNDArray::Emptyis called, the program crashes with the following error:
terminate called after throwing an instance of 'tvm::runtime::InternalError'
what(): [19:12:54] /home/masa/projects/dev/tvm/src/runtime/vulkan/vulkan_stream.cc:123:
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
Check failed: (__e == VK_SUCCESS) is false: Vulkan Error, code=-13: Unknown Vulkan error code
Stack trace:
0: tvm::runtime::vulkan::VulkanStream::Synchronize()
1: _ZN3tvm7runtime6vulkan15VulkanDeviceAPI13FreeDataSpac
2: tvm::runtime::NDArray::Internal::DefaultDeleter(tvm::runtime::Object*)
3: tvm::runtime::NDArray::CopyTo(DLDevice const&) const
4: tvm::runtime::vm::CopyTo(tvm::runtime::ObjectRef, DLDevice const&)
5: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::vm::VirtualMachine::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::$_6>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
6: TVMFuncCall
- Even if I fix the above problem by making sure that all allocations go through
PooledAllocator, my program still crashes due to too much allocation of host memory (haven’t looked into why so much host memory is allocated when I’m running on a GPU target). Also, if I use the CPU target, the program is just killed after reaching the memory limit and beforetry/catchsucceeds in catching memory allocation faiulure.
So I think we need a better way to decide when to call ReleaseAll() early if necessary. Should we add a device API to query the max available memory and call ReleaseAll() when we reach say 90% ? This doesn’t work if other memory-hungry processes are in use…