Cast from float64 to float16 cause Segmentation fault

hgt312 · August 21, 2020, 12:34pm

Failed on ubuntu 18.04, with both llvm8 and llvm10. However, it works fine on my MacBook. In addition, cast from float32 to float16 is ok.

To re-produce

import tvm
import topi
from tvm import te
import numpy as np

itype = "float64"
otype = "float16"

x = te.placeholder((2, 2), name='x', dtype=itype)
y = topi.cast(x, otype)
s = te.create_schedule(y.op)
f = tvm.build(s, [x, y], "llvm")
nx = tvm.nd.array(np.random.normal(size=(2, 2)).astype(itype))
ny = tvm.nd.array(np.zeros((2, 2), otype))
f(nx, ny)

output

Segmentation fault (core dumped)

kparzysz · August 21, 2020, 2:52pm

This is crashing because the functions that perform the Float16 conversions are not present. They don’t get resolved at runtime in the generated code and so their addresses remain null.

kparzysz · August 21, 2020, 3:38pm

The problem here is in finding out where to get these functions from. They are present in clang’s compiler-rt, and (afaik) in gcc’s libgcc (possibly with different names), but in TVM we don’t have any indication as to where they are on any particular system.

tgall_foo · August 21, 2020, 4:23pm

There is a different thread where I ran into something similar on arm64 so I wonder if they aren’t exactly the same issue.

In my case clang was ICEing when building TVM. The culprit is :

src/relay/transforms/pattern_util.h

399 #if (__ARM_FP16_FORMAT_IEEE == 1)
400     if (array->dtype.bits == 16) {
401       return reinterpret_cast<__fp16*>(array->data)[i];
402     }
403 #endif

Talking to some coworkers in Linaro that work on the llvm toolchain this is due to the fact that llvm is missing support. They were going to add it to their todo list.

Also if I build tvm natively on arm64 with gcc, libtvm.so will error out with : OSError: /home/debian/tvm/build/libtvm.so: undefined symbol: __extendhftf2

Adding -static-libgcc in the top level CMakeLists.txt fixes it.

IE: target_link_libraries(tvm -static-libgcc ${TVM_LINKER_LIBS} ${TVM_RUNTIME_LINKER_LIBS})

kparzysz · August 21, 2020, 5:04pm

This is actually a good idea. I was thinking about loading the library into the execution engine, but if the symbols are defined in the current process, we don’t need to do that. We need to make sure they don’t get garbage collected by the linker though.

LLVM does support float16. These functions are implemented in libclang_rt.builtins-<arch>.a, e.g. libclang_rt.builtins-x86_64.a on x86. Usually clang uses libgcc by default, but you can also use compiler-rt with -rtlib=compiler-rt flag.

tgall_foo · August 24, 2020, 1:48pm

Here is the patch to llvm to fix the ICE I’d mentioned if building tvm with clang :

https://reviews.llvm.org/D86453

Based on what @kparzysz indicates I suspect this only helps arm64.

hongh · June 15, 2021, 7:09am

Did you solve this problem? I also encountered the same problem.

hgt312 · June 16, 2021, 3:29pm

Not solved yet do you test it with some newer version of llvm?

hongh · June 17, 2021, 2:25am

My LLVM version is 10.0.0.

AndrewZhaoLuo · June 25, 2021, 5:10pm

This does not crash when I run it. I’m on an m1 macbook with llvm 11.1.0.

Try upgrading?