Deploying OpenCL on Apple Hardware - Metal SIGSEGV and SIGBUS

schell · July 27, 2021, 10:11pm

I’m deploying my OpenCL inference on Apple hardware (x86_64) and am running into a crash at runtime, after my program successfully completes some inference requests.

Sometimes this crash presents as a SIGSEGV and other times as a SIGBUS. Regardless of the signal it always happens on the Dispatch queue: com.Metal.CompletionQueueDispatch thread. The odd thing is that my inference is targeting OpenCL and my libtvm and libtvm_runtime dylibs are built with USE_METAL=OFF. It seems to me that there really should be no Metal interaction whatsoever.

Here is an lldb session during one of these crashes:

(lldb) target create "./my-program"
Current executable set to '/Users/schell/code/My-Program/let-the-right-one-in/my-program' (x86_64).
(lldb) run --database data.db --address 127.0.0.1:33005
Process 68637 launched: '/Users/schell/code/My-Program/let-the-right-one-in/my-program' (x86_64)
My-Program version 1.4.2 starting...
2021-07-27 15:48:42.009787+1200 my-program[68637:2034957] +[MTLIOAccelDevice registerDevices]: Zero Metal services found
Process 68637 stopped
* thread #28, queue = 'com.Metal.CompletionQueueDispatch', stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x000000011712672a AppleMetalOpenGLRenderer`invocation function for block in GLDQueueRec::handleDstBuffer(GLDBufferRec*, GLDBufferImageRegionRec const*) + 26
AppleMetalOpenGLRenderer`invocation function for block in GLDQueueRec::handleDstBuffer(GLDBufferRec*, GLDBufferImageRegionRec const*):
->  0x11712672a <+26>: addq   (%rax), %r14
    0x11712672d <+29>: movq   0x5fd54(%rip), %rsi       ; "contents"
    0x117126734 <+36>: callq  *0x5b926(%rip)            ; (void *)0x00007fff202a0780: objc_msgSend
    0x11712673a <+42>: addq   0x38(%rbx), %rax
Target 0: (my-program) stopped.
(lldb) bt
* thread #28, queue = 'com.Metal.CompletionQueueDispatch', stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
  * frame #0: 0x000000011712672a AppleMetalOpenGLRenderer`invocation function for block in GLDQueueRec::handleDstBuffer(GLDBufferRec*, GLDBufferImageRegionRec const*) + 26
    frame #1: 0x00007fff285e4472 Metal`MTLDispatchListApply + 34
    frame #2: 0x00007fff285e498d Metal`-[_MTLCommandBuffer didCompleteWithStartTime:endTime:error:] + 577
    frame #3: 0x00007fff38794f33 IOGPU`-[IOGPUMetalCommandBuffer didCompleteWithStartTime:endTime:error:] + 188
    frame #4: 0x00007fff285e4622 Metal`-[_MTLCommandQueue commandBufferDidComplete:startTime:completionTime:error:] + 161
    frame #5: 0x00007fff3879badc IOGPU`__IOGPUNotificationQueueSetDispatchQueue_block_invoke + 164
    frame #6: 0x00007fff20258886 libdispatch.dylib`_dispatch_client_callout4 + 9
    frame #7: 0x00007fff2026faa0 libdispatch.dylib`_dispatch_mach_msg_invoke + 444
    frame #8: 0x00007fff2025e473 libdispatch.dylib`_dispatch_lane_serial_drain + 263
    frame #9: 0x00007fff202705e2 libdispatch.dylib`_dispatch_mach_invoke + 484
    frame #10: 0x00007fff2025e473 libdispatch.dylib`_dispatch_lane_serial_drain + 263
    frame #11: 0x00007fff2025f0c0 libdispatch.dylib`_dispatch_lane_invoke + 417
    frame #12: 0x00007fff2025e473 libdispatch.dylib`_dispatch_lane_serial_drain + 263
    frame #13: 0x00007fff2025f08d libdispatch.dylib`_dispatch_lane_invoke + 366
    frame #14: 0x00007fff20268bed libdispatch.dylib`_dispatch_workloop_worker_thread + 811
    frame #15: 0x00007fff203ff4c0 libsystem_pthread.dylib`_pthread_wqthread + 314
    frame #16: 0x00007fff203fe493 libsystem_pthread.dylib`start_wqthread + 15

I think it’s notable that this gets printed at startup:

2021-07-27 15:48:42.009787+1200 my-program[68637:2034957] +[MTLIOAccelDevice registerDevices]: Zero Metal services found

Has anyone else run into this? Any help or insight is much appreciated.

elvin-n · July 29, 2021, 9:12pm

Just curious what are the reasons to use OpenCL on Apple instead Metal? Apple does not support OpenCL well and even warns on their opencl page If you are using OpenCL for computational tasks in your Mac app, we recommend that you transition to Metal and Metal Performance Shaders.

schell · July 29, 2021, 9:30pm

Well, thank you for the warning! I didn’t know OpenCL was in such a state on the Mac.

I tried shipping with Metal and ran into problems both tuning for Metal as well as deploying for Metal. Here is the error I get when trying to convert my ONNX models to Metal TVM shared objects:

Traceback (most recent call last):
  File "/Users/schell/my-app/libtvm/scripts/metal_test.py", line 6, in <module>
    mod = loaded_lib["default"](dev)
  File "/Users/schell/my-app/libtvm/tvm/python/tvm/runtime/module.py", line 107, in __getitem__
    return self.get_function(name)
  File "/Users/schell/my-app/libtvm/tvm/python/tvm/runtime/module.py", line 91, in get_function
    raise AttributeError("Module has no function '%s'" % name)
AttributeError: Module has no function 'default'

For deployment I’m using the Rust bindings to TVM and I think the happy path for deployment to Metal has not been well-trodden. Running inference resulted in a SIGSEGV, maybe due to the above?

Because of time constraints I chose to target OpenCL instead, and get pretty good results in tests, but when packaged and deployed I get the unsettling Metal thread segfault.

elvin-n · July 30, 2021, 9:23pm

May I ask several more questions?

Are you trying to deploy for MacOS or iOS and which versions?
What is your original language app - ObjectiveC or Swift?
What are the target and target_host (especially) that you used during compilation of your network?
Have you enabled Matal during the build of the tvm for your platform set(USE_METAL ON) in config.cmake?

schell · July 30, 2021, 9:46pm

Hi Elvin, thanks

Are you trying to deploy for MacOS or iOS and which versions?

I am deploying for MacOS >= 10.13.

What is your original language app - ObjectiveC or Swift?

It is an Electron app with the frontend written in Typescript and the backend written in Rust. We are using the Rust bindings of TVM for inference.

What are the target and target_host (especially) that you used during compilation of your network?

The target is opencl and the target_host is llvm -mtriple=x86_64-apple-darwin.

Have you enabled Matal during the build of the tvm for your platform set(USE_METAL ON) in config.cmake?

I originally compiled TVM with USE_METAL=OFF (because my target is OpenCL) but after seeing the crash report (which always says the crash happens on a Metal thread) I tried compiling with USE_METAL=ON. Compiling with Metal enabled seems to have no effect. I still see the same crash.

I should add that the error thrown is not always a SIGSEGV, I’ve also seen this Metal thread crash with a SIGBUS. Others at my company have simply had their laptops freeze after a huge spike in memory usage, requiring a hard power cycle !

One possible conclusion is that the problem lies within the Rust bindings and that there is a memory leak - possibly exhausting the available RAM or GPU memory, causing Metal to throw (or simply choke up).

schell · August 1, 2021, 10:21pm

I found an on-topic issue on the TVM github: [Rust] Memory leak in NDArray · Issue #6559 · apache/tvm · GitHub

schell · August 10, 2021, 2:22am

@elvin-n - I compiled my models to run on the Metal target and now I get a different error when trying to run inference:

Check failed: (dev.device_id >= 0 && static_cast<size_t>(dev.device_id) < devices.size()) is false: Invalid Metal device_id=0

Oddly enough, before this failed check I see this line printed:

[14:29:11] ~/tvm/src/runtime/metal/metal_device_api.mm:165: Intializing Metal device 0, name=Apple M1

Looking at the code in tvm/src/runtime/metal/metal_device_api.mm it seems that this check should absolutely not fail if I have a device at all, which I know I do based on the line printed just before the check.

elvin-n · August 10, 2021, 8:43am

Since you get into metal_device_api.mm you had to recompile tvm with USE_METAL=ON, right? just to dblcheck.

Intializing Metal device 0, name=Apple M1 message does not says that it was initialized properly. It tried… Will verify to run Metal model on Mac M1

elvin-n · August 10, 2021, 5:02pm

just verified inception-v1-9.onnx on Metal on M1 - works well.

parameters for network compilation

target = "metal"
target_host = "llvm -mtriple=arm64-apple-darwin"
...
def m1_create_dylib(output, objects):
    xcode.create_dylib(output, objects, arch="arm64", sdk="macosx")
m1_create_dylib.output_format = "dylib"
...
    # compilation in default mode without tuning flow so far. Tuning is another independent action to be done after making of network working in default mode
    with tvm.transform.PassContext(opt_level=3):
        lib = relay.build(mod, target=target, target_host=target_host, params=params)
        lib.export_library(out_lib_name_tuned, fcompile=m1_create_dylib)

schell · August 10, 2021, 10:30pm

Thank you for the snippet. This is the first time I’ve heard of the sdk="macosx" bit. I’ll try this and see if that makes a difference. I wouldn’t be surprised if the default fcompile sees arm64 and assumes that the host is iOS.

schell · August 10, 2021, 10:33pm

And yes, I recompiled my deployment TVM with USE_METAL=ON.

elvin-n · August 11, 2021, 5:39am

Just to dblcheck - you have Apple M1, right? Because previously you mentioned I’m deploying my OpenCL inference on Apple hardware (x86_64) and pointed target_host is llvm -mtriple=x86_64-apple-darwin. If you have Apple M1, the above assertions are wrong. You do not have x86_64 and target should be like in my snippet: llvm -mtriple=arm64-apple-darwin.

As for sdk - if you are compyling for MacOS, most likely sdk = "macosx" it is default value. The reason of crash most likely due to wrong target_host

schell · August 12, 2021, 9:01am

My development machine is an M1 but my inference application is built with an x86_64 toolchain to target x86_64 machines. The results I’m reporting here happen when I run this inference application in Rosetta on my M1 machine, but I get the same errors when running on a native Intel x86_64 machine.

It looks like the original segfault and bus errors were caused by a series of memory leaks in the Rust bindings. There have been a number of PRs as a result.

github.com/apache/tvm

[Rust][Fix] Memory leak

main ← jroesch:rust-memory-leak

opened 01:19AM - 11 Aug 21 UTC

jroesch

+52 -69

I found a really dumb memory leak that was probably introduced by @mwillsey and …I refactoring the Function and Object system and debugging memory issues. The old drop implementation was now invalid in the new owned function model. Instead of replicating RC behavior I added an inner owned pointer which drops the underlying allocation and wrapped it in an Arc. cc @Lunderberg @schell

I will close this issue and open a new one to discuss the Metal error. Thank you for your help @elvin-n

schell · August 12, 2021, 9:03am

Ah, hrm - I guess this is a discussion forum and not a list of issues, so there’s nothing to close! Maybe I should just edit the title to include the new error?

elvin-n · August 12, 2021, 9:20am

Seriously doubt that leaks can cause segfaults on the loading stage. I would bet more on Rosetta 2 and software stack running through emulation. On the same time the I get the same errors when running on a native Intel x86_64 machine fact is more interesting. I saw error messages from M1, are any error from x86_64 machine?

schell · August 12, 2021, 9:45am

I know, it seems far fetched, but the leaks in question were the entirety of every tensor buffer. My app runs multiple models in parallel tight loops, so even a small amount of memory leaked turns into a lot of memory over time, which has lots of implications with regard to available RAM, swap and disk space! When the RAM runs out and starts getting swapped to disk, the machine will eventually also run out of physical storage - so these memory issues are quite a big deal in my case.

There’s two different issues here - there’s the memory leak at inference time which is causing segfaults and sometimes bus errors and then there’s the Metal loading issue that precludes running inference.

For the Metal bug, I see this error no matter what the architecture is:

I have compiled .so files for x86_64 and for arm64, but I haven’t tried the sdk=macosx argument yet as I’ve been out of the office today. I will still let you know how that goes.

Maybe the important bit of information is that the .so files being loaded were compiled on an M1?

echuraev · August 24, 2021, 2:47pm

Hello @schell!

I answered to you on the github issue: Metal device error on M1 MBP · Issue #8700 · apache/tvm · GitHub

Please take a look on it.