[Auto schedular] An error from tune_network_cuda.py

@merrymercy @comaniac

I got the following error CUDA_ERROR_MISALIGNED_ADDRESS when running tutorial/auto_schedular/tune_network_cuda.py. But it seems the tuning itself has finished and I see Mean inference time... printed at the end of stdout. How should I interepret this error?

----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|  ID  | Latency (ms) | Speed (GFLOPS) | Trials |
-------------------------------------------------
|    0 |        0.010 |           0.41 |      8 |
|    1 |        0.045 |          23.02 |      8 |
|    2 |        0.003 |          -0.00 |      8 |
|    3 |        0.262 |         542.30 |      8 |
|    4 |        0.227 |         624.89 |      8 |
|    5 |        0.269 |         528.05 |      8 |
|    6 |        0.458 |         252.62 |      8 |
|    7 |        0.242 |         475.82 |      8 |
|    8 |        0.261 |         440.17 |      8 |
|    9 |        0.159 |         721.21 |      8 |
|   10 |        0.131 |         884.44 |      8 |
|   11 |        0.127 |        1000.99 |      8 |
|   12 |        0.176 |         720.34 |      8 |
|   13 |        0.167 |         758.13 |      8 |
|   14 |        0.059 |        1946.44 |      8 |
|   15 |        0.141 |         913.46 |      8 |
|   16 |        0.098 |        1310.29 |      8 |
|   17 |        0.126 |        1015.62 |      8 |
|   18 |        0.033 |          67.22 |      8 |
|   19 |        0.112 |        2114.81 |      8 |
|   20 |        0.016 |        1570.39 |      8 |
|   21 |        0.012 |        1078.81 |      8 |
|   22 |        0.019 |         659.76 |      8 |
|   23 |        0.064 |         201.11 |      8 |
-------------------------------------------------

Estimated total latency: 3.318 ms       Trials: 192     Used time : 6437 s      Next ID: 6

----------------------------------------------------------------------

------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population       #s: 77  fail_ct: 1971   Time elapsed: 3.07
GA Iter: 0      Max score: 0.9920       Min score: 0.8032       #Pop: 16        #M+: 0  #M-: 0
GA Iter: 5      Max score: 1.0000       Min score: 0.9981       #Pop: 16        #M+: 1446       #M-: 0
GA Iter: 10     Max score: 1.0000       Min score: 0.9991       #Pop: 16        #M+: 1579       #M-: 0
EvolutionarySearch              #s: 16  Time elapsed: 97.31
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
...**terminate called after throwing an instance of 'dmlc::Error'
  what():  [12:05:35] /home/masa/projects/dev/tvm/src/runtime/cuda/cuda_module.cc:61: CUDAError: cuModuleUnload(module_[i]) failed with error: CUDA_ERROR_MISALIGNED_ADDRESS
Stack trace:
  [bt] (0) /home/masa/projects/dev/tvm/build/libtvm.so(+0x143f168) [0x7f9d6ae1a168]
  [bt] (1) /home/masa/projects/dev/tvm/build/libtvm.so(tvm::runtime::SimpleObjAllocator::Handler<tvm::runtime::CUDAModuleNode>::Deleter_(tvm::runtime::Object*)+0x200) [0x7f9
d6ae1d220]
  [bt] (2) /home/masa/projects/dev/tvm/build/libtvm.so(tvm::runtime::SimpleObjAllocator::Handler<tvm::runtime::LibraryModuleNode>::Deleter_(tvm::runtime::Object*)+0x1c3) [0x
7f9d6ad962b3]
  [bt] (3) /home/masa/projects/dev/tvm/build/libtvm.so(+0x13b8873) [0x7f9d6ad93873]
  [bt] (4) /home/masa/projects/dev/tvm/build/libtvm.so(+0x140f57a) [0x7f9d6adea57a]
  [bt] (5) /home/masa/projects/dev/tvm/build/libtvm.so(tvm::runtime::LocalSession::FreeHandle(void*, int)+0x6b) [0x7f9d6ade9d5b]
  [bt] (6) /home/masa/projects/dev/tvm/build/libtvm.so(tvm::runtime::RPCFreeHandle(tvm::runtime::RPCSession*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x72) [0x7f9
d6addd582]
  [bt] (7) /home/masa/projects/dev/tvm/build/libtvm.so(void tvm::runtime::RPCEndpoint::EventHandler::SysCallHandler<void (*)(tvm::runtime::RPCSession*, tvm::runtime::TVMArgs
, tvm::runtime::TVMRetValue*)>(void (*)(tvm::runtime::RPCSession*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*))+0xb2) [0x7f9d6ade3b32]
  [bt] (8) /home/masa/projects/dev/tvm/build/libtvm.so(tvm::runtime::RPCEndpoint::EventHandler::HandleSyscall(tvm::runtime::RPCCode)+0x2c4) [0x7f9d6adda494]

oh I see the tutorial mentions the possibility of CUDA errors during tuning

But I got this error printed only at the end, just once. Is it normal? And maybe this is the reason I got 0.00 GFLOPS for the task with ID 2?

Hi, here are some hints for your case:

  • Like you have mentioned, the error is expected due to some invalid schedules during the search. As long as you can get the model compiled and executed successfully, you can ignore the errors.

  • The reason of 0 FLOP/s for task ID 2 is due to its simple compute. You can take a look at the compute DAG printed at the beginning of the tuning (also attached below). The compute DAG of task ID 2 only has simple ops but no complex ops such as Conv2D or Dense. A task with no complex ops usually has very low throughput (<0.01GLOP/s) so you might see 0 on the console. However, since these tasks won’t be the performance bottleneck, it’s usually fine to ignore them. Good news is that the task scheduler should also be aware of this, meaning that if you set a large number of trials, you should observe that the task scheduler only allocates few trials to these tasks, and spends more time on the performance bottleneck tasks.

========== Task 2  (workload key: ["7de313da0ca29a8c63f647791692430d"]) ==========
placeholder = PLACEHOLDER [1, 7, 7, 512]
tensor(ax0, ax1, ax2, ax3) += placeholder[ax0, ((ax1*7) + rv0), ((ax2*7) + rv1), ax3]
tensor(ax0, ax1, ax2, ax3) = (tensor[ax0, ax1, ax2, ax3]/(float32((select((bool)1, ((ax1 + 1)*7), (((ax1 + 1)*7) + 1)) - (ax1*7)))*float32((select((bool)1, ((ax2 + 1)*7), (((ax2 + 1)*7) + 1)) - (ax2*7)))))
1 Like

Thanks your tip helped make sense of the tuning output :slight_smile: Indeed task 2 seems to correspond to pooling:

========== Task 2  (workload key: ["7de313da0ca29a8c63f647791692430d"]) ==========
placeholder = PLACEHOLDER [1, 7, 7, 512]
tensor(ax0, ax1, ax2, ax3) += placeholder[ax0, ((ax1*7) + rv0), ((ax2*7) + rv1), ax3]
tensor(ax0, ax1, ax2, ax3) = (tensor[ax0, ax1, ax2, ax3]/(float32((select((bool)1, ((ax1 + 1)*7), (((ax1 + 1)*7) + 1)) - (ax1*7)))*float32((select((bool)1, ((ax2 + 1)*7), (((ax2 + 1)*7) + 1)) - (ax2*7)))))

I’m a bit surprised to see pooling or softmax being subject to tuning… Moreover, there is a constant EXTRACT_COMPLEX_TASK_ONLY https://github.com/apache/tvm/blob/main/python/tvm/auto_scheduler/relay_integration.py#L127 which is used by default. So that means softmax and pooling are considered as a “complex” op?

Now it makes sense that the 0.00 GFLOPS result appears even in the official tutorial:

I’m sure some people will also be confused by this, so maybe we can add explanation to the tutorial…

You’re basically right. The reason this task was extracted is due to my PR:

In this PR, I used a simple logic to judge whether a TE compute has at least one complex op:

https://github.com/apache/tvm/blob/main/src/relay/backend/compile_engine.cc#L162

As you may notice, using op_pattern to judge complex ops results in the polling and softmax layers being extracted. I was considering this problem too and finally took the opinion from Lianmin: Since sometimes it is helpful to tune the softmax layer for a few trials, we should still extract it and let task scheduler judge how many resources we should allocate.

On the other hand, if you try include_simple_tasks, you could learn what tasks are considered as “simple” in the current implementation. They are basically just small ops like transpose and reshape.

Finally, I agree with you that we might need to improve the message and the tutorial to avoid the future confusion. You are welcome to file a PR for that, or I could do that next week when I got time :slight_smile:

1 Like

Thanks for the explanation. I think I understand the reason for introducing that change (e.g. avoid hard coding which ops to tune, so that other ops can be transparently tuned etc).

Maybe the real problem is conv and dense having the same op pattern as pooling, softmax etc. I think we can also introduce a new op pattern to distinguish those ops, and extract only those ops with the new op pattern (the one we do want to tune such as conv).