[frontend][pytorch] TVM compatibility with Torch 1.12.0

oviazlo · September 9, 2022, 10:44am

Hello,

Currently TVM uses Torch 1.11.0 (judging from ubuntu_install_onnx.sh script). Are there any known blockers to use Torch 1.12.0?

I was trying to run some examples with Torch 1.12.0. When I run this one: gallery/how_to/deploy_models/deploy_object_detection_pytorch.py I got the error message below. However, this example runs okay with Torch 1.11.0. I will appreciate your help, thanks!

 Traceback (most recent call last): 
   File "deploy_object_detection_pytorch.py", line 122, in <module> 
     mod, params = relay.frontend.from_pytorch(script_module, shape_list) 
   File "/home/ubuntu/tvm/python/tvm/relay/frontend/pytorch.py", line 4542, in from_pytorch 
     outputs = converter.convert_operators(_get_operator_nodes(graph.nodes()), outputs, ret_name) 
   File "/home/ubuntu/tvm/python/tvm/relay/frontend/pytorch.py", line 3916, in convert_operators 
     relay_out = relay_op( 
   File "/home/ubuntu/tvm/python/tvm/relay/frontend/pytorch.py", line 812, in fill_ 
     return self.full_impl(self.infer_shape(data), fill_value, input_types[0]) 
   File "/home/ubuntu/tvm/python/tvm/relay/frontend/pytorch.py", line 679, in full_impl 
     out = _op.full(_expr.const(fill_value, dtype=dtype), size, dtype=dtype) 
   File "/home/ubuntu/tvm/python/tvm/relay/expr.py", line 517, in const 
     raise ValueError("value has to be scalar or NDArray") 
 ValueError: value has to be scalar or NDArray

masahi · September 9, 2022, 7:40pm

There is no blocker, someone has to do the upgrade work. And indeed, the MaskRCNN model used in deploy_object_detection_pytorch.py always brings challenges when we upgrade.

oviazlo · October 4, 2022, 10:56am

Hi, @masahi Thanks for the answer and help with the PR. Could you please briefly describe how much work is needed to upgrade TVM to Torch version 1.12.0 and what exactly it is? I was running pytorch frontend unit tests (./tests/scripts/task_python_frontend.sh) locally on my machine, and tests were working okay with Torch 1.12.0. Are there any other checks I can do? Maybe I can test it in your CI? Thanks in advance!

masahi · October 4, 2022, 7:07pm

All tests under tests/python/frontend/pytorch need to pass, on both CPU and GPU, with the new Torch version.

As you already found by running deploy_object_detection_pytorch.py, MaskRCNN import is probably broken with the new version, so we need to fix that. test_object_detection.py uses the same MaskRCNN model.

I also expect some tests in test_forward.py to be broken.

oviazlo · October 18, 2022, 11:42am

Hi @masahi. Could you give me a few pointers how to run CI with PyTorch 1.12.0 environment? I would to have a look are there any issues with it. I was running tests locally, and I don’t see any big issues with tests (but I run tests only on CPU). Test for Mask R-CNN passes successfully after the fix, also my last PR has to fix the issues with FX quantization tests.

Is it enough if I change the version of the torch in docker/install/ubuntu_install_onnx.sh file? Or one need to follow an instruction as you write here:

github.com/apache/tvm

bump PyTorch version to 1.11

apache:main ← t-vi:bump_pytorch

opened 05:27AM - 26 Mar 22 UTC

t-vi

+17 -156

This bumps PyTorch to 1.11 and fixes 3 test failures. The bump is required to en…able the libtorch_ops fallback due to DLPack version incompatibilities. QAT training has its own `fuse_modules` version (`fuse_modules_qat`) in PyTorch, so I changed the test. Two amendments to the front end: - `searchsorted` gets more (optional) parameters to its signature, - There is a `sub` variant with `alpha` (`a - alpha * b`). PyTorch rewrites `rsub` with `alpha` to this, but we ignored it. Now we handle `sub` with alpha. Thank you, @masahi for getting me started with the bump and pointing out the test failures. Any errors are my own.

Thank you in advance!

masahi · October 18, 2022, 7:41pm

The CI update process is a bit complicated. Changing docker/install/ubuntu_install_onnx.sh is not enough but it is the first step. So can you send such PR?

After that, I can test PT 1.12 on our CI in the ci-docker-staging branch. If all tests pass, we can make the actual image update in Jenkinsfile.

masahi · October 19, 2022, 10:23am

ok there is only one test failing, and it is coming from ONNX test. See https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/288/pipeline/463 (search for “FAILED”, capital)

tests/python/frontend/onnx/test_forward.py::test_aten failed, because it uses PyTorch to generate the model, and PT 1.12 has some issues with this test. Can you take a look?

oviazlo · October 19, 2022, 10:45am

Yes, I will have a look. Thanks!