@masahi Yea, I also found this issue a few months ago. If there’s an OOM, the exception will just flee… So I added another try/catch block and tried to fix that by calling ReleaseAll when OOM. The exception issue is very weird and I was not able to debug it (the exception just fled away and I cannot catch it during GDB).
I am not sure if calling ReleaseAll in advance could help. What about creating a global memory state per device (but it gonna be a big change)? Or simply unifying all memory allocation into a “PoolAllocator” (just like what TensorFlow did) which also enables users to control the memory limit. Or let’s say the memory pool should not hold a super huge memory chunk (e.g., 1 GB).