[AUTOTVM] Why do some deep learning models that can be compiled to NNVM fail for auto-tuning

jjiang2cal · October 9, 2018, 10:05pm

Hi,

I’m reading the tutorial for auto-tuning and am interested in combining it with compiling deep learning models. It looks like that we can get net and params from various nnvm frontends and then do the auto-tuning as in the tutorial at ease. So I tried a coreml squeezenet 1.1 example (converted from caffe by coremltools; caffe model: https://github.com/DeepScale/SqueezeNet). It can be compiled to nnvm successfully, as the from_coreml tutorial. However, if I change

......
elif name == 'custom':
    import coremltools
    coreml_model = coremltools.models.MLModel('squeezenet.mlmodel')
    net, params = nnvm.frontend.from_coreml(coreml_model)
......
network = 'custom'
......

It reported error:

Traceback (most recent call last):
 File "tune_nnvm_cuda.py", line 262, in <module>
   tune_and_evaluate(tuning_option)
 File "tune_nnvm_cuda.py", line 227, in tune_and_evaluate
   symbols=(nnvm.sym.conv2d,))
 File "/tvm/python/tvm/autotvm/task/nnvm_integration.py", line 248, in extract_from_graph
   nnvm.compiler.build(graph, target=tracing_target, shape=shape, dtype=dtype)
 File "/tvm/nnvm/python/nnvm/compiler/build_module.py", line 305, in build
   graph = graph.apply("GraphCompile")
 File "/tvm/nnvm/python/nnvm/graph.py", line 234, in apply
   check_call(_LIB.NNGraphApplyPasses(self.handle, npass, cpass, ctypes.byref(ghandle)))
 File "/tvm/nnvm/python/nnvm/_base.py", line 75, in check_call
   raise NNVMError(py_str(_LIB.NNGetLastError()))
nnvm._base.NNVMError: [21:51:25] /tvm/nnvm/src/compiler/compile_engine.cc:212: Check failed: out[i].ndim() == out_info[i].ndim() (4 vs. 0) broadcast_add

I don’t understand, if regular compiling and auto-tuning both call nnvm.compiler.build, but with different targets, why does regular compiling succeed and auto-tuning fail? Could anyone explain it? Thanks.

P.S: I’m not interested in squeezenet 1.1 tuned parameters, just take it as an example because it’s small. I’m interested in compiling and auto-tuning a general deep learning model.

merrymercy · October 11, 2018, 12:43am

from_coreml tutorial uses cuda target. Task extraction in autotvm uses a customized llvm target.

Maybe @masahi can have some thoughts?

masahi · October 11, 2018, 1:13am

I’ve seen the same error in a different context (when I was rolling my own winograd integration into nnvm). If I remember correctly, the problem is this loop being not executed at all, i.e. shape_vec is corrupted. I have no idea why this is happening with tracing target.

jjiang2cal · October 11, 2018, 11:16pm

@masahi

I have another error on a different model (caffe -> coreml -> nnvm, compiles to nnvm if no auto tune).

Traceback (most recent call last):
 File "tune_nnvm_cuda.py", line 253, in <module>
   tune_and_evaluate(tuning_option)
 File "tune_nnvm_cuda.py", line 218, in tune_and_evaluate
   symbols=(nnvm.sym.conv2d,))
 File "/tvm/python/tvm/autotvm/task/nnvm_integration.py", line 248, in extract_from_graph
   nnvm.compiler.build(graph, target=tracing_target, shape=shape, dtype=dtype)
 File "/tvm/nnvm/python/nnvm/compiler/build_module.py", line 281, in build
   graph = optimize(graph, shape, dtype, layout)
 File "/tvm/nnvm/python/nnvm/compiler/build_module.py", line 176, in optimize
   graph = graph.apply(["InferShape", "SimplifyInference"])
 File "/tvm/nnvm/python/nnvm/graph.py", line 234, in apply
   check_call(_LIB.NNGraphApplyPasses(self.handle, npass, cpass, ctypes.byref(ghandle)))
 File "/tvm/nnvm/python/nnvm/_base.py", line 75, in check_call
   raise NNVMError(py_str(_LIB.NNGetLastError()))
nnvm._base.NNVMError: [23:10:19] /tvm/nnvm/src/compiler/simplify_inference.cc:27: Check failed: dshape.ndim() != 0 (0 vs. 0)

Does this come from the same cause?

jjiang2cal · October 12, 2018, 12:01am

If I cannot auto-tune the entire model, can I tune each individual con2d?

The auto-tune examples generate logs. Should I append these logs to ~/.tvm/tophub/cuda_v0.02.log? Should I delete the default log and replace it with the generated log?

masahi · October 12, 2018, 3:07am

not sure, but the error is happening after “InferShape” pass, so if the inferred shape is corrupted, then yes it would be the same problem.

masahi · October 12, 2018, 3:10am

yes, in principle you can auto-tune individual layer manually. But that would be very tedious so I won’t recommend doing that. The issues that only appear with “tracing” target happened in elsewhere as well. Maybe we should look into what is going on.

merrymercy · October 12, 2018, 5:08am

It’s is not easy to tune individual layer manually.

Can you upload some models so that we can reproduce and try to fix?

When you want to use your own log, you can append your log to ~/.tvm/tophub/cuda_v0.02.log or use

github.com

dmlc/tvm/blob/2c9d70ad9b99ab83a2e8ea97b5c9170c5ca11ea1/tutorials/autotvm/tune_nnvm_cuda.py#L219


net, params, input_shape, out_shape = get_network(network, batch_size=1)
tasks = autotvm.task.extract_from_graph(net, target=target,
                                        shape={'data': input_shape}, dtype=dtype,
                                        symbols=(nnvm.sym.conv2d,))


# run tuning tasks
print("Tuning...")
tune_tasks(tasks, **tuning_opt)


# compile kernels with history best records
with autotvm.apply_history_best(log_file):
    print("Compile...")
    with nnvm.compiler.build_config(opt_level=3):
        graph, lib, params = nnvm.compiler.build(
            net, target=target, shape={'data': input_shape}, params=params, dtype=dtype)


    # export library
    tmp = tempdir()
    filename = "net.tar"
    lib.export_library(tmp.relpath(filename))

to explicitly load a log file

jjiang2cal · October 12, 2018, 5:12pm

@merrymercy

This is the SqueezeNet model I used: https://github.com/jjiang2cal/autotvm_tune/blob/master/models/SqueezeNet_v1.1.mlmodel. The original caffe model is from https://github.com/DeepScale/SqueezeNet/tree/master/SqueezeNet_v1.1. This model can reproduce the first error I saw.

The second error can be reproduced from caffe resnet50 model from (https://onedrive.live.com/?authkey=!AAFW2-FVoxeVRck&id=4006CBB8476FF777!17887&cid=4006CBB8476FF777). Use coremltools to convert caffe to coreml, and then convert it to nnvm. The file is about ~100 MB, so I did not upload the mlmodel file, but can provide it if necessary.

Also, for the second case, it compiles to nnvm at default opt_level=2. However, if I change opt_level to 3, I got this error below. Not sure if this is related. SqueezeNet can compile to nnvm at opt_level=3 though.

Traceback (most recent call last):
  File "from_coreml.py", line 79, in <module>
    graph, lib, params = nnvm.compiler.build(sym, target, shape_dict, params=params)
  File "/tvm/nnvm/python/nnvm/compiler/build_module.py", line 292, in build
    graph = graph.apply("InferShape")
  File "/tvm/nnvm/python/nnvm/graph.py", line 234, in apply
    check_call(_LIB.NNGraphApplyPasses(self.handle, npass, cpass, ctypes.byref(ghandle)))
  File "/tvm/nnvm/python/nnvm/_base.py", line 75, in check_call
    raise NNVMError(py_str(_LIB.NNGetLastError()))
nnvm._base.NNVMError: Error in operator conv2d1: [21:44:02] /tvm/nnvm/src/top/nn/convolution.cc:65: Check failed: dshape.ndim() == 4U (5 vs. 4) Input data should be 4D

Stack trace returned 10 entries:
[bt] (0) /tvm/build/libtvm.so(dmlc::StackTrace[abi:cxx11]()+0x5a) [0x7f5603b1b5aa]
[bt] (1) /tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f5603b1c158]
[bt] (2) /tvm/nnvm/python/nnvm/../../../build/libnnvm_compiler.so(nnvm::top::Conv2DInferShape(nnvm::NodeAttrs const&, std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >*, std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >*)+0x7d9) [0x7f55ff66fe99]
[bt] (3) /tvm/nnvm/python/nnvm/../../../build/libnnvm_compiler.so(+0x130b81) [0x7f55ff5aab81]
[bt] (4) /tvm/nnvm/python/nnvm/../../../build/libnnvm_compiler.so(+0x131eaa) [0x7f55ff5abeaa]
[bt] (5) /tvm/nnvm/python/nnvm/../../../build/libnnvm_compiler.so(+0x132d96) [0x7f55ff5acd96]
[bt] (6) /tvm/nnvm/python/nnvm/../../../build/libnnvm_compiler.so(nnvm::ApplyPasses(nnvm::Graph, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&)+0x32b) [0x7f55ff569e9b]
[bt] (7) /tvm/nnvm/python/nnvm/../../../build/libnnvm_compiler.so(NNGraphApplyPasses+0x348) [0x7f55ff552f28]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f5644b15e40]
[bt] (9) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f5644b158ab]

Thanks for looking into it.

merrymercy · October 16, 2018, 5:13am

@jjiang2cal I tried your script and found the issue.

If you pass params to the nnvm.compiler.build in task extraction as you do for the normal build, then this error will disappear.

github.com

dmlc/tvm/blob/b4946e770c5f0e7ba90d299b4ad3292fbc9834a0/python/tvm/autotvm/task/nnvm_integration.py#L248


# run compiler to collect all TOPI calls during compilation
env.reset(topi_funcs)


# disable logger temporarily
old_state = logger.disabled
logger.disabled = True


# use a "tracing" target to do a fake compile for collecting topi calls
tracing_target = _target.create("llvm -device=tracing")
nnvm.compiler.engine.clear_cache()
nnvm.compiler.build(graph, target=tracing_target, shape=shape, dtype=dtype)


logger.disabled = old_state


# create tasks for target
tasks = []
for task_name, args in env.get_tasks():
    tasks.append(create(task_name, args,
                        target=target, target_host=target_host,
                        template_key='direct'))

Maybe we can send a patch, although I don’t know why.

FrozenGene · October 16, 2018, 12:27pm

We shouldn’t pass params into it. We should infer the shape even if we don’t have params. I have met this issue several times, every time I find is something else wrong, but not related with the params. And in our environment, I find if we pass params, we can not train using multi cpu cores.

merrymercy · October 16, 2018, 6:23pm

In terms of the multi-cpu issue, it is because the thread pool in tvm is incompatible with python’s multiprocessing package.
After executing a tvm function, we cannot use multiprocessing in python anymore.

If you pass params, then nnvm will run a tvm function to transform params, which breaks python multiprocessing.

My solution is launching a new python thread (thread is ok. Don’t need processing) to run task extraction. This separates the environment and it works. Or you can pickle the tasks by one script and tune them by another script.

For the problem in this thread, I agree with you. Passing params can be a quick fix, but there must be something wrong in other places.

jjiang2cal · October 16, 2018, 10:14pm

Yes, passing the params currently makes the code run. Thanks for the quick patch.

Why does normal build need params, and autotvm does not need params if it is supposed to work without knowing params?

merrymercy · October 17, 2018, 12:29am

Ideally, both cases should work. But there is something wrong with model converters or nnvm compiler. I do not have the plan to look into it.

So currently you can pass params for your models. As I mentioned before, you also have to use another thread (or another script) to do task extraction to avoid the multiprocessing issue.