Unable to build Relay function twice

m3at · September 25, 2020, 5:48am

Hi all,

I’m encountering a peculiar issue when building a Relay function twice. It’s likely caused by a misunderstanding on my side, so I’m asking here before creating an issue on github.

In brief, when I do relay.build(…) it is working as expected but running it again I get an error

Here is the code to reproduce the issue:

import numpy as np
import onnx
import tvm
from tvm import relay
from tvm.contrib import graph_runtime

# Prepare parameters
input_shape = [1, 3, 380, 380]
example_input = np.random.randn(*input_shape).astype(np.float32)
target = "llvm -mcpu=core-avx2"
ctx = tvm.cpu(0)

# Get model from ONNX
onnx_model = onnx.load("./onnx_export/b4.onnx")
mod, params = relay.frontend.from_onnx(
    onnx_model, {"input0": input_shape},
)

# Build module
with tvm.transform.PassContext(opt_level=3):
    graph_module = relay.build(mod, target=target, target_host=target, params=params)

# Run, no issue
runtime_module = graph_runtime.GraphModule(graph_module['default'](ctx))
runtime_module.set_input(key="input0", value=tvm.nd.array(example_input))
runtime_module.run()
tvm_output = runtime_module.get_output(0).asnumpy()

# Build again, or use autotvm.task.extract_from_program
# Error (see below)
with tvm.transform.PassContext(opt_level=3):
    graph_module = relay.build(mod, target=target, target_host=target, params=params)

Here is the relevant part of the of error:

  %45 = multiply(%43, %44);
  %46 = nn.pad(%45, pad_width=[[0, 0], [0, 0], [0, 1], [0, 1], [0, 0]]) an internal invariant was violated while typechecking your program [05:19:37] ../src/relay/op/nn/pad.cc:125: Check failed: data->shape.size() == param->pad_width.size(): There should be as many pad width pairs as shape dimensions but the shape has 4 dimensions and there are 5 pad width pairs.
; ;
  %47 = nn.conv2d(%46, meta[relay.Constant][38], strides=[2, 2], padding=[0, 0, 0, 0], groups=144, kernel_size=[3, 3]);

The model itself is an EfficientNet{B4,B6} exported as ONNX, can be obtained by running:

python3 -m pip install geffnet
wget "https://github.com/rwightman/gen-efficientnet-pytorch/blob/master/onnx_export.py"
python3 onnx_export.py ./onnx_export/b4.onnx --model="tf_efficientnet_b4_ns" --img-size=380

Do you have an idea of what is happening?

jwfromm · October 1, 2020, 4:57pm

It looks like your graph definition is being changed by build. Making this small modification to the code snippet you provided fixes the issue.

orig_mod = deepcopy(mod)

# Build module
with tvm.transform.PassContext(opt_level=3):
    graph_module = relay.build(mod, target=target, target_host=target, params=params)

# Run, no issue
runtime_module = graph_runtime.GraphModule(graph_module['default'](ctx))
runtime_module.set_input(key="input0", value=tvm.nd.array(example_input))
runtime_module.run()
tvm_output = runtime_module.get_output(0).asnumpy()

# Build again, or use autotvm.task.extract_from_program
# now works since we're using a copy of the original module.
with tvm.transform.PassContext(opt_level=3):
    graph_module = relay.build(orig_mod, target=target, target_host=target, params=params)

It looks like only the nn.pad operator is getting changed though, which is probably not intended behavior. @jroesch, do you have any idea why this is happening? Presumably during build the convolutions are converted to NCHWc and the pad is transformed to match. It’s strange that this ends up affecting the original module though.

m3at · October 2, 2020, 2:03am

Oh you’re right, simply using deepcopy avoid the issue, thank you Josh! If the change to nn.pad in the original module is not intended, should I create a github issue? I will wait for @jroesch reply.

jroesch · October 2, 2020, 8:32am

Module has methods which mutate it for performance reasons but the intent is that passes do not mutate the input module and instead return an updated module. I assume a pass is accidentally mutating the input module instead of doing a functional update, and returning the updated copy.

Most of the passes invoke the internal copy on write machinery to ensure we don’t mutate the input model, but potentially one of the passes has a bug which is causing it to mutate the input.

cc @zhiics I think Zhi has fixed a few cases like this before he might have more ideas.

zhiics · October 2, 2020, 4:27pm

Yeah, @jroesch explained all of it. This should be caused by some passes that mutate the IRModule. We fixed some of it before by using CopyOnWrite. It looks there are still more passes mutating the module.

m3at · October 5, 2020, 8:07am

Thank you for the explanations. It appear to be a bug (albeit not a significant one) so I created an issue on Github to keep track of it.