[AUTOTVM] Why do some deep learning models that can be compiled to NNVM fail for auto-tuning

Hi,

I’m reading the tutorial for auto-tuning and am interested in combining it with compiling deep learning models. It looks like that we can get net and params from various nnvm frontends and then do the auto-tuning as in the tutorial at ease. So I tried a coreml squeezenet 1.1 example (converted from caffe by coremltools; caffe model: https://github.com/DeepScale/SqueezeNet). It can be compiled to nnvm successfully, as the from_coreml tutorial. However, if I change

......
elif name == 'custom':
    import coremltools
    coreml_model = coremltools.models.MLModel('squeezenet.mlmodel')
    net, params = nnvm.frontend.from_coreml(coreml_model)
......
network = 'custom'
......

It reported error:

Traceback (most recent call last):
 File "tune_nnvm_cuda.py", line 262, in <module>
   tune_and_evaluate(tuning_option)
 File "tune_nnvm_cuda.py", line 227, in tune_and_evaluate
   symbols=(nnvm.sym.conv2d,))
 File "/tvm/python/tvm/autotvm/task/nnvm_integration.py", line 248, in extract_from_graph
   nnvm.compiler.build(graph, target=tracing_target, shape=shape, dtype=dtype)
 File "/tvm/nnvm/python/nnvm/compiler/build_module.py", line 305, in build
   graph = graph.apply("GraphCompile")
 File "/tvm/nnvm/python/nnvm/graph.py", line 234, in apply
   check_call(_LIB.NNGraphApplyPasses(self.handle, npass, cpass, ctypes.byref(ghandle)))
 File "/tvm/nnvm/python/nnvm/_base.py", line 75, in check_call
   raise NNVMError(py_str(_LIB.NNGetLastError()))
nnvm._base.NNVMError: [21:51:25] /tvm/nnvm/src/compiler/compile_engine.cc:212: Check failed: out[i].ndim() == out_info[i].ndim() (4 vs. 0) broadcast_add

I don’t understand, if regular compiling and auto-tuning both call nnvm.compiler.build, but with different targets, why does regular compiling succeed and auto-tuning fail? Could anyone explain it? Thanks.

P.S: I’m not interested in squeezenet 1.1 tuned parameters, just take it as an example because it’s small. I’m interested in compiling and auto-tuning a general deep learning model.

from_coreml tutorial uses cuda target. Task extraction in autotvm uses a customized llvm target.

Maybe @masahi can have some thoughts?

I’ve seen the same error in a different context (when I was rolling my own winograd integration into nnvm). If I remember correctly, the problem is this loop being not executed at all, i.e. shape_vec is corrupted. I have no idea why this is happening with tracing target.

@masahi

I have another error on a different model (caffe -> coreml -> nnvm, compiles to nnvm if no auto tune).

Traceback (most recent call last):
 File "tune_nnvm_cuda.py", line 253, in <module>
   tune_and_evaluate(tuning_option)
 File "tune_nnvm_cuda.py", line 218, in tune_and_evaluate
   symbols=(nnvm.sym.conv2d,))
 File "/tvm/python/tvm/autotvm/task/nnvm_integration.py", line 248, in extract_from_graph
   nnvm.compiler.build(graph, target=tracing_target, shape=shape, dtype=dtype)
 File "/tvm/nnvm/python/nnvm/compiler/build_module.py", line 281, in build
   graph = optimize(graph, shape, dtype, layout)
 File "/tvm/nnvm/python/nnvm/compiler/build_module.py", line 176, in optimize
   graph = graph.apply(["InferShape", "SimplifyInference"])
 File "/tvm/nnvm/python/nnvm/graph.py", line 234, in apply
   check_call(_LIB.NNGraphApplyPasses(self.handle, npass, cpass, ctypes.byref(ghandle)))
 File "/tvm/nnvm/python/nnvm/_base.py", line 75, in check_call
   raise NNVMError(py_str(_LIB.NNGetLastError()))
nnvm._base.NNVMError: [23:10:19] /tvm/nnvm/src/compiler/simplify_inference.cc:27: Check failed: dshape.ndim() != 0 (0 vs. 0)

Does this come from the same cause?

If I cannot auto-tune the entire model, can I tune each individual con2d?

The auto-tune examples generate logs. Should I append these logs to ~/.tvm/tophub/cuda_v0.02.log? Should I delete the default log and replace it with the generated log?

not sure, but the error is happening after “InferShape” pass, so if the inferred shape is corrupted, then yes it would be the same problem.

yes, in principle you can auto-tune individual layer manually. But that would be very tedious so I won’t recommend doing that. The issues that only appear with “tracing” target happened in elsewhere as well. Maybe we should look into what is going on.

It’s is not easy to tune individual layer manually.

Can you upload some models so that we can reproduce and try to fix?

When you want to use your own log, you can append your log to ~/.tvm/tophub/cuda_v0.02.log or use


to explicitly load a log file

@merrymercy

This is the SqueezeNet model I used: https://github.com/jjiang2cal/autotvm_tune/blob/master/models/SqueezeNet_v1.1.mlmodel. The original caffe model is from https://github.com/DeepScale/SqueezeNet/tree/master/SqueezeNet_v1.1. This model can reproduce the first error I saw.

The second error can be reproduced from caffe resnet50 model from (https://onedrive.live.com/?authkey=!AAFW2-FVoxeVRck&id=4006CBB8476FF777!17887&cid=4006CBB8476FF777). Use coremltools to convert caffe to coreml, and then convert it to nnvm. The file is about ~100 MB, so I did not upload the mlmodel file, but can provide it if necessary.

Also, for the second case, it compiles to nnvm at default opt_level=2. However, if I change opt_level to 3, I got this error below. Not sure if this is related. SqueezeNet can compile to nnvm at opt_level=3 though.

Traceback (most recent call last):
  File "from_coreml.py", line 79, in <module>
    graph, lib, params = nnvm.compiler.build(sym, target, shape_dict, params=params)
  File "/tvm/nnvm/python/nnvm/compiler/build_module.py", line 292, in build
    graph = graph.apply("InferShape")
  File "/tvm/nnvm/python/nnvm/graph.py", line 234, in apply
    check_call(_LIB.NNGraphApplyPasses(self.handle, npass, cpass, ctypes.byref(ghandle)))
  File "/tvm/nnvm/python/nnvm/_base.py", line 75, in check_call
    raise NNVMError(py_str(_LIB.NNGetLastError()))
nnvm._base.NNVMError: Error in operator conv2d1: [21:44:02] /tvm/nnvm/src/top/nn/convolution.cc:65: Check failed: dshape.ndim() == 4U (5 vs. 4) Input data should be 4D

Stack trace returned 10 entries:
[bt] (0) /tvm/build/libtvm.so(dmlc::StackTrace[abi:cxx11]()+0x5a) [0x7f5603b1b5aa]
[bt] (1) /tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f5603b1c158]
[bt] (2) /tvm/nnvm/python/nnvm/../../../build/libnnvm_compiler.so(nnvm::top::Conv2DInferShape(nnvm::NodeAttrs const&, std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >*, std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >*)+0x7d9) [0x7f55ff66fe99]
[bt] (3) /tvm/nnvm/python/nnvm/../../../build/libnnvm_compiler.so(+0x130b81) [0x7f55ff5aab81]
[bt] (4) /tvm/nnvm/python/nnvm/../../../build/libnnvm_compiler.so(+0x131eaa) [0x7f55ff5abeaa]
[bt] (5) /tvm/nnvm/python/nnvm/../../../build/libnnvm_compiler.so(+0x132d96) [0x7f55ff5acd96]
[bt] (6) /tvm/nnvm/python/nnvm/../../../build/libnnvm_compiler.so(nnvm::ApplyPasses(nnvm::Graph, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&)+0x32b) [0x7f55ff569e9b]
[bt] (7) /tvm/nnvm/python/nnvm/../../../build/libnnvm_compiler.so(NNGraphApplyPasses+0x348) [0x7f55ff552f28]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f5644b15e40]
[bt] (9) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f5644b158ab]

Thanks for looking into it.

@jjiang2cal I tried your script and found the issue.

If you pass params to the nnvm.compiler.build in task extraction as you do for the normal build, then this error will disappear.

Maybe we can send a patch, although I don’t know why.

We shouldn’t pass params into it. We should infer the shape even if we don’t have params. I have met this issue several times, every time I find is something else wrong, but not related with the params. And in our environment, I find if we pass params, we can not train using multi cpu cores.

In terms of the multi-cpu issue, it is because the thread pool in tvm is incompatible with python’s multiprocessing package.
After executing a tvm function, we cannot use multiprocessing in python anymore.

If you pass params, then nnvm will run a tvm function to transform params, which breaks python multiprocessing.

My solution is launching a new python thread (thread is ok. Don’t need processing) to run task extraction. This separates the environment and it works. Or you can pickle the tasks by one script and tune them by another script.

For the problem in this thread, I agree with you. Passing params can be a quick fix, but there must be something wrong in other places.

Yes, passing the params currently makes the code run. Thanks for the quick patch.

Why does normal build need params, and autotvm does not need params if it is supposed to work without knowing params?

Ideally, both cases should work. But there is something wrong with model converters or nnvm compiler. I do not have the plan to look into it.

So currently you can pass params for your models. As I mentioned before, you also have to use another thread (or another script) to do task extraction to avoid the multiprocessing issue.