Currently, VM PooledAllocator
releases its memory only when the underlying device fails to allocate more memory: tvm/pooled_allocator.h at 553778885388a9eff4d611e1022baecd75c69088 · apache/tvm · GitHub. This causes a program crash when doing repeated inferences with dynamic batch size. See [Bug] PyTorch MaskRCNN GPU OOM error · Issue #8233 · apache/tvm · GitHub for a minimal repro.
It seems there are two issues with it:
-
AllocDataSpace
can be called outside ofPooledAllocator
, byNDArray::Empty(...)
tvm/ndarray.cc at 4d9bc9b4a3e9e8d3420efe60a52964fcd4c29c8d · apache/tvm · GitHub. That call is not protected by try/catch, so if almost all memory are held byPooledAllocator
andNDArray::Empty
is called, the program crashes with the following error:
terminate called after throwing an instance of 'tvm::runtime::InternalError'
what(): [19:12:54] /home/masa/projects/dev/tvm/src/runtime/vulkan/vulkan_stream.cc:123:
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
Check failed: (__e == VK_SUCCESS) is false: Vulkan Error, code=-13: Unknown Vulkan error code
Stack trace:
0: tvm::runtime::vulkan::VulkanStream::Synchronize()
1: _ZN3tvm7runtime6vulkan15VulkanDeviceAPI13FreeDataSpac
2: tvm::runtime::NDArray::Internal::DefaultDeleter(tvm::runtime::Object*)
3: tvm::runtime::NDArray::CopyTo(DLDevice const&) const
4: tvm::runtime::vm::CopyTo(tvm::runtime::ObjectRef, DLDevice const&)
5: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::vm::VirtualMachine::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::$_6>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
6: TVMFuncCall
- Even if I fix the above problem by making sure that all allocations go through
PooledAllocator
, my program still crashes due to too much allocation of host memory (haven’t looked into why so much host memory is allocated when I’m running on a GPU target). Also, if I use the CPU target, the program is just killed after reaching the memory limit and beforetry/catch
succeeds in catching memory allocation faiulure.
So I think we need a better way to decide when to call ReleaseAll()
early if necessary. Should we add a device API to query the max available memory and call ReleaseAll()
when we reach say 90% ? This doesn’t work if other memory-hungry processes are in use…