I’m deploying my OpenCL inference on Apple hardware (x86_64) and am running into a crash at runtime, after my program successfully completes some inference requests.
Sometimes this crash presents as a SIGSEGV and other times as a SIGBUS. Regardless of the signal it always happens on the Dispatch queue: com.Metal.CompletionQueueDispatch
thread. The odd thing is that my inference is targeting OpenCL and my libtvm and libtvm_runtime dylibs are built with USE_METAL=OFF. It seems to me that there really should be no Metal interaction whatsoever.
Here is an lldb session during one of these crashes:
(lldb) target create "./my-program"
Current executable set to '/Users/schell/code/My-Program/let-the-right-one-in/my-program' (x86_64).
(lldb) run --database data.db --address 127.0.0.1:33005
Process 68637 launched: '/Users/schell/code/My-Program/let-the-right-one-in/my-program' (x86_64)
My-Program version 1.4.2 starting...
2021-07-27 15:48:42.009787+1200 my-program[68637:2034957] +[MTLIOAccelDevice registerDevices]: Zero Metal services found
Process 68637 stopped
* thread #28, queue = 'com.Metal.CompletionQueueDispatch', stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
frame #0: 0x000000011712672a AppleMetalOpenGLRenderer`invocation function for block in GLDQueueRec::handleDstBuffer(GLDBufferRec*, GLDBufferImageRegionRec const*) + 26
AppleMetalOpenGLRenderer`invocation function for block in GLDQueueRec::handleDstBuffer(GLDBufferRec*, GLDBufferImageRegionRec const*):
-> 0x11712672a <+26>: addq (%rax), %r14
0x11712672d <+29>: movq 0x5fd54(%rip), %rsi ; "contents"
0x117126734 <+36>: callq *0x5b926(%rip) ; (void *)0x00007fff202a0780: objc_msgSend
0x11712673a <+42>: addq 0x38(%rbx), %rax
Target 0: (my-program) stopped.
(lldb) bt
* thread #28, queue = 'com.Metal.CompletionQueueDispatch', stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
* frame #0: 0x000000011712672a AppleMetalOpenGLRenderer`invocation function for block in GLDQueueRec::handleDstBuffer(GLDBufferRec*, GLDBufferImageRegionRec const*) + 26
frame #1: 0x00007fff285e4472 Metal`MTLDispatchListApply + 34
frame #2: 0x00007fff285e498d Metal`-[_MTLCommandBuffer didCompleteWithStartTime:endTime:error:] + 577
frame #3: 0x00007fff38794f33 IOGPU`-[IOGPUMetalCommandBuffer didCompleteWithStartTime:endTime:error:] + 188
frame #4: 0x00007fff285e4622 Metal`-[_MTLCommandQueue commandBufferDidComplete:startTime:completionTime:error:] + 161
frame #5: 0x00007fff3879badc IOGPU`__IOGPUNotificationQueueSetDispatchQueue_block_invoke + 164
frame #6: 0x00007fff20258886 libdispatch.dylib`_dispatch_client_callout4 + 9
frame #7: 0x00007fff2026faa0 libdispatch.dylib`_dispatch_mach_msg_invoke + 444
frame #8: 0x00007fff2025e473 libdispatch.dylib`_dispatch_lane_serial_drain + 263
frame #9: 0x00007fff202705e2 libdispatch.dylib`_dispatch_mach_invoke + 484
frame #10: 0x00007fff2025e473 libdispatch.dylib`_dispatch_lane_serial_drain + 263
frame #11: 0x00007fff2025f0c0 libdispatch.dylib`_dispatch_lane_invoke + 417
frame #12: 0x00007fff2025e473 libdispatch.dylib`_dispatch_lane_serial_drain + 263
frame #13: 0x00007fff2025f08d libdispatch.dylib`_dispatch_lane_invoke + 366
frame #14: 0x00007fff20268bed libdispatch.dylib`_dispatch_workloop_worker_thread + 811
frame #15: 0x00007fff203ff4c0 libsystem_pthread.dylib`_pthread_wqthread + 314
frame #16: 0x00007fff203fe493 libsystem_pthread.dylib`start_wqthread + 15
I think it’s notable that this gets printed at startup:
2021-07-27 15:48:42.009787+1200 my-program[68637:2034957] +[MTLIOAccelDevice registerDevices]: Zero Metal services found
Has anyone else run into this? Any help or insight is much appreciated.