ONNX type mismatch when building with opt level 1

jonso · April 14, 2020, 11:22pm

Hey all,

I am working on a model that is written in PyTorch and exported to ONNX. During relay.build with opt level = 1, I ran into a type mismatch. The error does not occur when opt level = 0:

TypeError: Check failed: a.dtype() == b.dtype(): mismatched types
Error during compile function
-----------------------------
v0.0.4
fn (%p0: Tensor[(7, 1, 32, 1536), float32], Primitive=1) -> Tensor[(7, 1, 1536), float32] {
  %0 = reshape(%p0, newshape=[7, 1, -1, 1536]) /* ty=Tensor[(7, 1, 32, 1536), float32] */;
  %1 = take(%0, 0 /* ty=int64 */, axis=2) /* ty=Tensor[(7, 1, 1536), float32] */;
  reshape(%1, newshape=[-1, 1, 1536]) /* ty=Tensor[(7, 1, 1536), float32] */
}

This is in expr.h, in the function BinaryOpNode::make. The issue seems to be coming from the fact that the second input of take is an int64. I was able to workaround the issue by casting this argument to int32 in the ONNX frontend, but that solution really isn’t ideal.

I traced the take call back to this line of PyTorch:

new_data = all_data[:,:,0]

This is not the first time I’ve seen small type mismatch errors like this when importing from ONNX, and I don’t think that making a change on the PyTorch side is the right way to fix it (I’m not even sure there is a way to fix it, given the simplicitly of this line).

Is there a way we can fix this in TVM? For example, can we cast the int64 to int32? Also, why does it only show up when opt level = 1? Presumably this has to do with fusion, but I haven’t been able to figure out the root cause.

Thanks!

cc @jwfromm @masahi

masahi · April 15, 2020, 1:13am

yeah, I also remember being annoyed by this int32 vs int64 issue. I sent some PRs below but I don’t have a good solution.

github.com/apache/tvm

[ONNX] Remove unnecessary cast of constants to int32

master ← masahi:onnx-constant-dtype

opened 08:40AM - 23 Dec 19 UTC

masahi

+19 -2

This cast was introduced in #3387, but it causes an error in the following segme…nt of a graph because one of the inputs to the concat op is int64 while the other is int32. In ONNX, unsqueeze(-1) is supposed to be int64, but relay makes it int32. The output of Shape is int64, after the fix in #4528. I don't know the context where this cast was introduced, but since the tests are passing without cast I assume it is no longer necessary. Can you review? @zhiics @jroesch I can explain why the output of Shape needs to be int64 (#4528), if desired. ![image](https://user-images.githubusercontent.com/1776403/71346473-5261d500-25ab-11ea-9e32-37e82f090d5b.png)

github.com/apache/tvm

[ONNX] Use int64 for Shape op dtype

master ← masahi:fix-onnx-shape

opened 07:15AM - 16 Dec 19 UTC

masahi

+1 -2

The output type of [the ONNX Shape op](https://github.com/onnx/onnx/blob/master/…docs/Operators.md#Shape) is int64 , but [the default dtype](https://github.com/apache/incubator-tvm/blob/master/python/tvm/relay/op/tensor.py#L888) of Relay shape_of is int32. We need int64 here; Otherwise I get an error when converting ONNX Resize op which comes from PyTorch master (won't happen with v1.3). The explanation is boring, so I won't write it here. please review @icemelon9 @soiferj @jwfromm

Fortunately now that we have PyTorch frontend, and I don’t need to deal with other frameworks, I’m not using ONNX any more. I recommend switching to Torch frontend if that is also an option for you.