[frontend] Inconsistent precision between macOS and linux

Hi, I’m new to TVM and I recently find a weird problem.

I build TVM with exactly same version, and almost the same compilation option on macOS and ubuntu 20.04. Both of them can compile the pytorch model correctly. However, the precision of the model output differs.

To put it more simply, the TVM on my laptop(macOS) can pass the tests under pytorch/test_forward.py, but the TVM on my server(ubuntu) can not.

Only after I loose the tolerance in tvm.testing.assert_allclose to about 1e-2 can the TVM on ubuntu pass the test like test_forward.test_mnasnet0_5()

Does anyone have any idea on this problem? I will be very grateful if you can help.

System info:

  • macOS version: 10.15.7; python: 3.8.5; LLVM: 10.0.0; Pytorch: 1.8.1

  • Ubuntu version: 20.04; python: 3.8.6; LLVM: 10.0.0; Pytorch: 1.8.1+cu111

  • TVM version: 0.8.dev0, at fbdffeb546b350eedb470f07b7915341610e3367 commit


Here is some additional information:

I found this problem when I want to move the model compilation from my laptop to my server. assert_allclose passed on my laptop but failed on my server.

I tried to do cross compilation, that is, compile linux runtime on my laptop, and test the compiled model on my server, and vice versa. But the precision problem still exists: 1e-2 for Ubuntu, and 1e-4 for macOS.

I doubt this is a runtime problem, maybe even irrelevant to TVM. And I start to think that 1e-2 is actually acceptable. But the discussion in Aspirations for conversion and unit tests in the PyTorch frontend seems to support tighter precision requirement.

Also, tests on Jenkins are all passed. So I have no idea what is going on …


Update: This problem is not restricted to pytorch model. I also tried test_forward_inception_v3 under tensorflow test. The ubuntu failed the test and the macOS passed. It will give something like:

AssertionError: Not equal to tolerance rtol=1e-05, atol=1e-05

Mismatched elements: 14 / 1001 (1.4%)

Max absolute difference: 0.00014865

Max relative difference: 0.00256442

x: array([[5.194177e-05, 2.972199e-05, 3.040651e-04, …, 2.509769e-05, 1.136126e-04, 2.880761e-04]], dtype=float32)

y: array([[5.191198e-05, 2.969970e-05, 3.036689e-04, …, 2.506699e-05, 1.134691e-04, 2.877102e-04]], dtype=float32)

What targets do you run on mac os vs ubuntu?

Initially just target = "llvm" for both platform.

For cross compilation, on ubuntu I used

target = "llvm -mtriple=x86_64-apple-darwin",

On macOS I used

target = "llvm -mtriple=x86_64-linux-gnu"

CPU on macOS: Intel® Core™ i9-9880H CPU @ 2.30GHz

CPU on Ubuntu: Intel® Xeon® Bronze 3206R CPU @ 1.90GHz

After Import the graph to Relay you have mod and params

How to save and load IRModule:

    # bind params to mod
    mod = relay.build_module.bind_params_by_name(mod, params)
    # save IRModule to json string
    mod_str = tvm.ir.save_json(mod)  # save mod_str to file

    mod = tvm.ir.load_json(mod_str)  # load IRModule from json string

Use json string file to keep same IRModule on two plaforms, and try again.

Thanks for your advice. I tried your solution as follow:

# On macOS
# A function uses relay.frontend.from_pytorch
mod, params = model_import.torch2relay(test_model, dummy_input) 
net = mod['main']
mod_binded = tvm.relay.build_module.bind_params_by_name(net, params)
mod_str = tvm.ir.save_json(mod_binded)  # save mod_str to file
with open("macOS_mod.json", 'w') as file:
    file.write(mod_str)

# On Ubuntu
with open("macOS_mod.json", 'r') as file:
    mod_str = file.read()
mod_binded = tvm.ir.load_json(mod_str) 
lib = tvm.relay.build(mod_binded, target=target)

The result is still the same, Ubuntu has worse precision.

Just for reference, I compile the model on Ubuntu, use lib.get_params() to get its parameters, and turn all the tvm.nd.NDArray to numpy. Save it and load it to macOS, and compare all the parameters using

np.all([linux_param[key] == macos_param[key].asnumpy()])

for every key. And all of the comparison returns True. I think this result tells us that the two compiled models’ parameters are strictly equal. Based on this, the final output precision is still inconsistent.

Why are you cross compiling? Have you tried compiling on ubuntu, for ubuntu?

Oh, of course I tried this. This is the first thing I do. Cross compiling is another way to debug(or to provide more information), I think. Just as I said earlier, the first thing I do is to compile with target=“llvm”.

Besides, I think the point is, the TVM on my Ubuntu can not pass the test under tests/python/frontend

Yes it is weird. We run this test on every CI run, and I don’t recall we have flaky issue with test_mnasnet0_5. I just ran test_mnasnet0_5 on my laptop with AMD APU, both llvm and even vulkan backend can pass this test.

This problem is solved. Thanks for @fantasyRqg and @masahi. It turns out that this is not a platform problem, but a GPU hardware problem. I will summarize as follows:

First of all, for all the models compiled by TVM, I used the function verify_model under tests/python/frontend/pytorch/test_forward.py. This function will enable GPU when it is available. However, the Nvidia ampere GPU card has precision problem. 8 days ago, there is a PR in pytorch repo describing this:

Relax some TF32 test tolerance #56114

Also, here is another PR claiming that Ampere GPU gives reduced precision:

Fix TF32 failures in test_linalg.py #50453

In my case, the GPU on my server happens to be GTX 3090, a kind of Ampere GPU.

So it is not the TVM problem, it is the oracle that gives the wrong answer. I spent a lot of time trying to debug TVM, but it turns out this is indeed a GPU runtime problem.


Another thing that I want to mention is the usage of tvm.testing.assert_allclose in verify_model :

# usage: assert_allclose(actual, desired, rtol=1e-7, atol=1e-7)
# In verify_model
tvm.testing.assert_allclose(baseline_output, compiled_output, rtol=rtol, atol=atol)

It seems that the actual and the desired order in tvm.testing.assert_allclose is wrong. Based on the description in assert_allclose, I think this may potentially lead to bugs.

I see, that makes sense. PyTorch should have a way to disable TF32, probably we should use that in our PyTorch test for newer gpus.