Meta schedule not able to find a valid schedule for cuda

bgtier4 · March 5, 2023, 4:14pm

EDIT: I solved this. The issue is the the number parameter was set at default to 3, and this was too low I believe, even if I increased the number of trials.

I am running a script to tune YOLOX objection detection model with meta schedule based on this example from tvm.

However, the meta schedule cannot find any valid schedules for cuda it seems (I tested with llvm and it was successful), even with as many as 10000 trials. Here is a screenshot of the tuning information after 10000 trials targeting cuda.

Something else I noticed is that I need to specify my gpu attributes manually or I will get an error which seems a bit strange, like this: --target "cuda -max_threads_per_block 1024 -max_shared_memory_per_block 49152"

Lastly, here is the full command I am running: python tools/onnx_tune.py --model-name yolox-tiny --onnx-path local_models/yolox_tiny.onnx --input-shape '[{"name" : "images", "dtype" : "fp32", "shape" : [1, 3, 416, 416]}]' --target "cuda -max_threads_per_block 1024 -max_shared_memory_per_block 49152" --num-trials 10000 --work-dir ~/dev/YOLOX/meta_schedule/ --cpu-flush False --backend graph

Let me know if I can provide any more useful information.

junrushao · March 5, 2023, 10:15pm

Interesting. Did you check per-workload log for more details? If any trial fails, the error message should be recorded in the per-workload log

bgtier4 · March 7, 2023, 1:44am

Glancing over the log, it looks like all of the trials have errors, and the errors are directly related to actually running the schedule on GPU, as “VerifyGPUCode” is where all the failures are. The actual error itself (at the end of the output below) does not seem to provide very much information.

Here is an example of log output for trial 1:

2023-03-05 19:44:39 [INFO] [evolutionary_search.cc:713] Generating candidates......
2023-03-05 19:44:39 [INFO] [evolutionary_search.cc:715] Picked top 0 candidate(s) from database
2023-03-05 19:44:40 [INFO] [evolutionary_search.cc:533] Sample-Init-Population summary:
Postproc #0 [meta_schedule.DisallowDynamicLoop(0x76bc578)]: 0 failure(s)
Postproc #1 [meta_schedule.RewriteCooperativeFetch(0x8906ce8)]: 0 failure(s)
Postproc #2 [meta_schedule.RewriteUnboundBlock(0x8924d78)]: 0 failure(s)
Postproc #3 [meta_schedule.RewriteParallelVectorizeUnroll(0x8314a28)]: 0 failure(s)
Postproc #4 [meta_schedule.RewriteReductionBlock(0x75152f8)]: 0 failure(s)
Postproc #5 [meta_schedule.VerifyGPUCode(0x72b2648)]: 429 failure(s)
Postproc #6 [meta_schedule.RewriteTensorize(0x4a78028)]: 0 failure(s)
2023-03-05 19:44:40 [INFO] [evolutionary_search.cc:723] Sampled 83 candidate(s)
2023-03-05 19:44:41 [INFO] [evolutionary_search.cc:621] Evolve iter #0 done. Summary:
Postproc #0 [meta_schedule.DisallowDynamicLoop(0x76bc578)]: 0 failure(s)
Postproc #1 [meta_schedule.RewriteCooperativeFetch(0x8906ce8)]: 0 failure(s)
Postproc #2 [meta_schedule.RewriteUnboundBlock(0x8924d78)]: 0 failure(s)
Postproc #3 [meta_schedule.RewriteParallelVectorizeUnroll(0x8314a28)]: 0 failure(s)
Postproc #4 [meta_schedule.RewriteReductionBlock(0x75152f8)]: 0 failure(s)
Postproc #5 [meta_schedule.VerifyGPUCode(0x72b2648)]: 127 failure(s)
Postproc #6 [meta_schedule.RewriteTensorize(0x4a78028)]: 0 failure(s)
2023-03-05 19:44:43 [INFO] [evolutionary_search.cc:621] Evolve iter #1 done. Summary:
Postproc #0 [meta_schedule.DisallowDynamicLoop(0x76bc578)]: 0 failure(s)
Postproc #1 [meta_schedule.RewriteCooperativeFetch(0x8906ce8)]: 0 failure(s)
Postproc #2 [meta_schedule.RewriteUnboundBlock(0x8924d78)]: 0 failure(s)
Postproc #3 [meta_schedule.RewriteParallelVectorizeUnroll(0x8314a28)]: 0 failure(s)
Postproc #4 [meta_schedule.RewriteReductionBlock(0x75152f8)]: 0 failure(s)
Postproc #5 [meta_schedule.VerifyGPUCode(0x72b2648)]: 135 failure(s)
Postproc #6 [meta_schedule.RewriteTensorize(0x4a78028)]: 0 failure(s)
2023-03-05 19:44:44 [INFO] [evolutionary_search.cc:621] Evolve iter #2 done. Summary:
Postproc #0 [meta_schedule.DisallowDynamicLoop(0x76bc578)]: 0 failure(s)
Postproc #1 [meta_schedule.RewriteCooperativeFetch(0x8906ce8)]: 0 failure(s)
Postproc #2 [meta_schedule.RewriteUnboundBlock(0x8924d78)]: 0 failure(s)
Postproc #3 [meta_schedule.RewriteParallelVectorizeUnroll(0x8314a28)]: 0 failure(s)
Postproc #4 [meta_schedule.RewriteReductionBlock(0x75152f8)]: 0 failure(s)
Postproc #5 [meta_schedule.VerifyGPUCode(0x72b2648)]: 124 failure(s)
Postproc #6 [meta_schedule.RewriteTensorize(0x4a78028)]: 0 failure(s)
2023-03-05 19:44:45 [INFO] [evolutionary_search.cc:621] Evolve iter #3 done. Summary:
Postproc #0 [meta_schedule.DisallowDynamicLoop(0x76bc578)]: 0 failure(s)
Postproc #1 [meta_schedule.RewriteCooperativeFetch(0x8906ce8)]: 0 failure(s)
Postproc #2 [meta_schedule.RewriteUnboundBlock(0x8924d78)]: 0 failure(s)
Postproc #3 [meta_schedule.RewriteParallelVectorizeUnroll(0x8314a28)]: 0 failure(s)
Postproc #4 [meta_schedule.RewriteReductionBlock(0x75152f8)]: 0 failure(s)
Postproc #5 [meta_schedule.VerifyGPUCode(0x72b2648)]: 132 failure(s)
Postproc #6 [meta_schedule.RewriteTensorize(0x4a78028)]: 0 failure(s)
2023-03-05 19:44:45 [INFO] [evolutionary_search.cc:649] Scores of the best 64 candidates:
[1 : 16]:	0.9995  0.9977  0.9970  0.9969  0.9966  0.9949  0.9940  0.9935  0.9932  0.9930  0.9916  0.9903  0.9896  0.9889  0.9884  0.9876
[17 : 32]:	0.9865  0.9865  0.9863  0.9859  0.9842  0.9833  0.9827  0.9824  0.9820  0.9818  0.9816  0.9812  0.9804  0.9796  0.9792  0.9765
[33 : 48]:	0.9765  0.9760  0.9759  0.9752  0.9747  0.9746  0.9743  0.9726  0.9718  0.9716  0.9714  0.9690  0.9688  0.9674  0.9669  0.9652
[49 : 64]:	0.9650  0.9649  0.9640  0.9637  0.9632  0.9622  0.9617  0.9609  0.9608  0.9587  0.9587  0.9582  0.9571  0.9567  0.9543  0.9542
2023-03-05 19:44:45 [INFO] [evolutionary_search.cc:727] Got 64 candidate(s) with evolutionary search
2023-03-05 19:44:45 [INFO] [evolutionary_search.cc:730] Sending 64 candidates(s) for measurement
2023-03-05 20:24:46 [INFO] [task_scheduler.cc:121] [Task #0: fused_nn_conv2d_add_sigmoid] Trial #1: Error in running:
LocalRunner: An exception occurred
Subprocess terminated

junrushao · March 7, 2023, 1:47am

Looks like all failures happen in LocalRunner. Any outputs from it?

bgtier4 · March 7, 2023, 2:04am

I don’t see anything from LocalRunner (if that would appear in the log file or on the terminal). Besides output like the log output I sent in the last message, the log file seems to just have ir_module definitions and schedule transformations (if I’m using the terminology correctly).

I should note, just in case, there is a warning outputted to the terminal when tuning, but it appears whether I am using llvm or cuda as a target:

2023-03-07 00:38:59 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_conv2d_add_sigmoid"
[00:38:59] /home/benjamin-gilby/tvm_env/tvm/src/meta_schedule/database/json_database.cc:149: Warning: The size of the GetTopK result is smaller than requested. There are not enough valid records in the database for this workload.
2023-03-07 00:39:06 [INFO] [task_scheduler.cc:193] Sending 64 sample(s) to builder
2023-03-07 00:39:18 [INFO] [task_scheduler.cc:195] Sending 64 sample(s) to runner

junrushao · March 7, 2023, 2:14am

The warning is fine and should be ignored. I am not sure why the LocalRunner didn’t throw any error messages. How about switching to RPCRunner and seeing how it goes?

If configuring RPC tracker/server is too much trouble, you may fake an RPC process. Example:

with LocalRPC() as rpc:
    rpc_runner = RPCRunner(
        rpc_config=RPCConfig(
            tracker_host=rpc.tracker_host,
            tracker_port=rpc.tracker_port,
            tracker_key=rpc.tracker_key,
        ),
    )
    ...

bgtier4 · March 7, 2023, 5:16pm

I faked an RPC process in the way that you showed. The tuning result looked the same as before, but this time, the log has more information about the error. I’m investigating the stack trace now:

Trial #1: Error in running:
RPCRunner: An exception occurred
Traceback (most recent call last):
  File "/home/benjamin-gilby/tvm_env/tvm/python/tvm/meta_schedule/runner/rpc_runner.py", line 377, in resource_handler
    yield
  File "/home/benjamin-gilby/tvm_env/tvm/python/tvm/meta_schedule/runner/rpc_runner.py", line 408, in _worker_func
    repeated_args,
  File "/home/benjamin-gilby/tvm_env/tvm/python/tvm/meta_schedule/runner/rpc_runner.py", line 515, in default_run_evaluator
    return run_evaluator_common(rt_mod, device, evaluator_config, repeated_args)
  File "/home/benjamin-gilby/tvm_env/tvm/python/tvm/meta_schedule/runner/utils.py", line 117, in run_evaluator_common
    profile_result = evaluator(*args)
  File "/home/benjamin-gilby/tvm_env/tvm/python/tvm/runtime/module.py", line 357, in evaluator
    blob = feval(*args)
  File "/home/benjamin-gilby/tvm_env/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  3: TVMFuncCall
  2: tvm::runtime::RPCWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
  1: tvm::runtime::RPCClientSession::CallFunc(void*, TVMValue const*, int const*, int, std::function<void (tvm::runtime::TVMArgs)> const&)
  0: tvm::runtime::RPCEndpoint::CallFunc(void*, TVMValue const*, int const*, int, std::function<void (tvm::runtime::TVMArgs)>)
  File "/home/benjamin-gilby/tvm_env/tvm/src/runtime/rpc/rpc_endpoint.cc", line 804
TVMError: 
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (code == RPCCode::kReturn) is false: code=kShutdown

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/benjamin-gilby/tvm_env/tvm/python/tvm/exec/popen_worker.py", line 87, in main
    result = fn(*args, **kwargs)
  File "/home/benjamin-gilby/tvm_env/tvm/python/tvm/meta_schedule/runner/rpc_runner.py", line 408, in _worker_func
    repeated_args,
  File "/usr/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/benjamin-gilby/tvm_env/tvm/python/tvm/meta_schedule/runner/rpc_runner.py", line 381, in resource_handler
    f_cleanup(session, remote_path)
  File "/home/benjamin-gilby/tvm_env/tvm/python/tvm/meta_schedule/runner/rpc_runner.py", line 532, in default_cleanup
    session.remove(remote_path)
  File "/home/benjamin-gilby/tvm_env/tvm/python/tvm/rpc/client.py", line 144, in remove
    self._remote_funcs["remove"] = self.get_function("tvm.rpc.server.remove")
  File "/home/benjamin-gilby/tvm_env/tvm/python/tvm/rpc/client.py", line 72, in get_function
    return self._sess.get_function(name)
  File "/home/benjamin-gilby/tvm_env/tvm/python/tvm/runtime/module.py", line 171, in get_function
    self.handle, c_str(name), ctypes.c_int(query_imports), ctypes.byref(ret_handle)
  File "/home/benjamin-gilby/tvm_env/tvm/python/tvm/_ffi/base.py", line 348, in check_call
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  50: 0xffffffffffffffff
  49: _start
  48: __libc_start_main
  47: _Py_UnixMain
  46: 0x0000000000650da0
  45: 0x0000000000650afa
  44: _PyFunction_FastCallDict
  43: _PyEval_EvalCodeWithName
  42: _PyEval_EvalFrameDefault
  41: _PyFunction_FastCallKeywords
  40: _PyEval_EvalCodeWithName
  39: _PyEval_EvalFrameDefault
  38: _PyMethodDef_RawFastCallKeywords
  37: 0x0000000000546369
  36: _PyEval_EvalCodeWithName
  35: _PyEval_EvalFrameDefault
  34: _PyFunction_FastCallKeywords
  33: _PyEval_EvalCodeWithName
  32: _PyEval_EvalFrameDefault
  31: _PyFunction_FastCallDict
  30: _PyEval_EvalCodeWithName
  29: _PyEval_EvalFrameDefault
  28: _PyObject_FastCallDict
  27: 0x00000000004c06e1
  26: _PyFunction_FastCallDict
  25: _PyEval_EvalFrameDefault
  24: _PyMethodDescr_FastCallKeywords
  23: 0x00000000005dcb58
  22: 0x00000000005dc83f
  21: 0x00000000004ba127
  20: _PyEval_EvalFrameDefault
  19: _PyFunction_FastCallKeywords
  18: _PyEval_EvalFrameDefault
  17: _PyFunction_FastCallKeywords
  16: _PyEval_EvalFrameDefault
  15: _PyFunction_FastCallKeywords
  14: _PyEval_EvalFrameDefault
  13: _PyFunction_FastCallKeywords
  12: _PyEval_EvalCodeWithName
  11: _PyEval_EvalFrameDefault
  10: 0x0000000000537c30
  9: _PyObject_FastCallKeywords
  8: 0x00007fdd9f4fffa2
  7: _ctypes_callproc
  6: ffi_call
  5: ffi_call_unix64
  4: TVMModGetFunction
  3: tvm::runtime::ModuleNode::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)
  2: tvm::runtime::RPCModuleNode::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)
  1: tvm::runtime::RPCClientSession::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
  0: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::RPCEndpoint::Init()::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#2}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  File "/home/benjamin-gilby/tvm_env/tvm/src/runtime/rpc/rpc_endpoint.cc", line 684
TVMError: 
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (code == RPCCode::kReturn) is false: code=1

junrushao · March 7, 2023, 6:56pm

I’m not sure, but is this method missing? Would you like to check your TVM build?

junrushao · March 7, 2023, 7:00pm

This means the RPC server shuts down unexpectedly for some reason…

bgtier4 · March 14, 2023, 12:56am

I see what you are saying. Excuse my inexperience, but how can I check my TVM build to see if tvm.rpc.server.remove is present?

zxybazh · March 22, 2023, 9:57pm

Hi, part of this tutorial mentioned how to set up a rpc server & tracker, and would you please follow this tutorial to set up rpc server? And in the meantime you should be able to see more logging in the terminal where you started this server.