DLRM / ONNX strided slice in loop issues?

kazimuth · July 13, 2021, 7:56pm

I’m having issues trying to compile a small DLRM model with TVM. I’ve been attempting to go via PyTorch → ONNX → TVM. Importing the network fails with an error “relay.concatenate requires all tensors have the same ndim”. I believe this may be due to the presence of strided slices in an Onnx loop, as mentioned in this PR: [Relay][Frontend][Onnx] Loop Support by jwfromm · Pull Request #6700 · apache/tvm · GitHub

@jwfromm , can you give me any pointers for where I might start looking to fix that issue with ONNX imports?

To reproduce:

git clone 'https://github.com/facebookresearch/dlrm'
cd dlrm
CUDA_VISIBLE_DEVICES= python dlrm_s_pytorch.py --save-onnx --mini-batch-size=256 --data-size=100

That should take only a few seconds and will result in a fresh onnx file with a small DLRM model trained on random data.

Add this file to the repo:

import onnx
import tvm
from tvm import relay

onnx_model = onnx.load('dlrm_s_pytorch.onnx')
onnx.checker.check_model(onnx_model)
mod, params = relay.frontend.from_onnx(onnx_model)

And run it to get the error.

jwfromm · July 14, 2021, 11:14pm

Thanks for pointing this out @kazimuth. I took a look and what’s happening is that pytorch is creating technically invalid Loop nodes in that they have no output shape defined. Our current loop importer relies on this output shape info and so gets confused. I’ve changed how we import Loops in this PR to not rely on onnx output shapes and confirmed that the model imports and runs after the change. Can you give it a shot and see if it works on your end?