CL_INVALID_WORK_GROUP_SIZE error after auto-tuning for OpenCL on Android Device

After a long tuning of 15 hours, at the final test run with the best parameter found, I got CL_INVALID_WORK_GROUP_SIZE error; and if I try to use the final compiled “.so” file to test again, I can get the same CL_INVALID_WORK_GROUP_SIZE error.

My Android Device Info:
device_id: 0
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 @ 0
CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 @ 1
CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 @ 2
CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024

In file opencl_module.cc’s operator(): I added some log info to print the work group information

    for (cl_uint i = 0; i < work_dim; ++i) {
      wl.work_size[i] *= wl.work_size[i + 3];
      LOG(WARNING) << i << ".  global_work_size=" << wl.work_size[i] << " local_work_size:" << wl.work_size[i + 3];
    } 

and following are 2 examples when I get CL_INVALID_WORK_GROUP_SIZE error.

Error Example 1:

  1. global_work_size=7 local_work_size:7
  2. global_work_size=1 local_work_size:1
  3. global_work_size=16 local_work_size:16
    /android_rpc/app/src/main/jni/…/…/…/…/…/…/include/…/src/runtime/opencl/opencl_module.cc:88: Check failed: e == CL_SUCCESS OpenCL Error, code=-54: CL_INVALID_WORK_GROUP_SIZE

Error Example 2:

  1. global_work_size=7 local_work_size:7
  2. global_work_size=1 local_work_size:1
  3. global_work_size=256 local_work_size:8
    /android_rpc/app/src/main/jni/…/…/…/…/…/…/include/…/src/runtime/opencl/opencl_module.cc:88: Check failed: e == CL_SUCCESS OpenCL Error, code=-54:

I check the API of clEnqueueNDRangeKernel, and checked the reason of CL_INVALID_WORK_GROUP_SIZE, it seems to me the work_size are totally fine for my device
https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clEnqueueNDRangeKernel.html

The only thing I haven’t checked yet, is whether TVM will use [__attribute__((reqd_work_group_size(X, Y, Z))) to declare the work-group size for kernel in the program source, I want to know whether TVM use reqd_work_group_size or not, and if yes, where in source code TVM use that, and I want to add more debug info to see whether this is the root cause.

Any other suggestion is also welcomed!
Thanks in advance!

Can you share more details about what script you are using for autotuning and the resulting logs?. Typically this scenario should never happen as only configurations that run without error are considered. It could be that the current configuration space has no valid configs for the search or that a fallback config is still being used for some reason.

I was just running the tune_nnvm_mobile_gpu.py in TVM tutorials, use mobilenet;

I also print the failed kernel name, it’s: fuse_conv2d_broadcast_add_relu_kernel0

any clue?

Can you share the tuning log file that the run produced? We can quickly check if there were errors there.

Sure, I don’t have the 15 hours run log anymore, but I can provide a log built using "n_trial" : 5 (for quick debug, I’ll launch a new tuning tonight with larger n_trial)

Error log print by my newly added debug info :

kernelName:fuse_conv2d_broadcast_add_relu_1_kernel0
0.  global_work_size=112 local_work_size:112
1.  global_work_size=28 local_work_size:2
2.  global_work_size=32 local_work_size:1

I found a warning at the end of tuning, Can this be the root cause?

WARNING:autotvm:Cannot find config for target=opencl -device=adreno, workload=('depthwise_conv2d_nchw', (1, 32, 112, 112, 'float32'), (32, 1, 3, 3, 'float32'), (1, 1), (1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.

The log_file of TVM

{"i": ["opencl -device=adreno", "topi_nn_dense", [["TENSOR", [1, 1024], "float32"], ["TENSOR", [1000, 1024], "float32"], null], {}, null, {"i": 0, "t": "winograd", "c": null, "e": []}], "r": [[0.0015796354], 0, 0.6794140338897705, 1551918925.7954354], "v": 0.1}
{"i": ["opencl -device=adreno", "topi_nn_conv2d", [["TENSOR", [1, 1024, 7, 7], "float32"], ["TENSOR", [1024, 1024, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 1024, 7, 7, "float32"], [1024, 1024, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 7006, "t": "direct", "c": null, "e": [["tile_f", "sp", [4, 16, 4, 4]], ["tile_y", "sp", [7, 1, 1, 1]], ["tile_x", "sp", [1, 1, 7, 1]], ["tile_rc", "sp", [512, 2]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 0], ["unroll_explicit", "ot", 0]]}], "r": [[0.22823480200000001], 0, 65.10172700881958, 1551919113.8254087], "v": 0.1}
{"i": ["opencl -device=adreno", "topi_nn_depthwise_conv2d_nchw", [["TENSOR", [1, 1024, 7, 7], "float32"], ["TENSOR", [1024, 1, 3, 3], "float32"], [1, 1], [1, 1], [1, 1], "float32"], {}, ["depthwise_conv2d_nchw", [1, 1024, 7, 7, "float32"], [1024, 1, 3, 3, "float32"], [1, 1], [1, 1], [1, 1], "float32"], {"i": 26620, "t": "direct", "c": null, "e": [["tile_f", "sp", [128, 2, 4, 1]], ["tile_y", "sp", [1, 7, 1, 1]], ["tile_x", "sp", [1, 1, 1, 7]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 1]]}], "r": [[0.0233335468], 0, 18.761876344680786, 1551919133.5696566], "v": 0.1}
{"i": ["opencl -device=adreno", "topi_nn_conv2d", [["TENSOR", [1, 512, 7, 7], "float32"], ["TENSOR", [1024, 512, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 512, 7, 7, "float32"], [1024, 512, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 237835, "t": "direct", "c": null, "e": [["tile_f", "sp", [16, 8, 1, 8]], ["tile_y", "sp", [1, 1, 1, 7]], ["tile_x", "sp", [1, 1, 1, 7]], ["tile_rc", "sp", [256, 2]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 1]]}], "r": [[0.8051178643], 0, 31.408406734466553, 1551919165.967637], "v": 0.1}
{"i": ["opencl -device=adreno", "topi_nn_depthwise_conv2d_nchw", [["TENSOR", [1, 512, 14, 14], "float32"], ["TENSOR", [512, 1, 3, 3], "float32"], [2, 2], [1, 1], [1, 1], "float32"], {}, ["depthwise_conv2d_nchw", [1, 512, 14, 14, "float32"], [512, 1, 3, 3, "float32"], [2, 2], [1, 1], [1, 1], "float32"], {"i": 995, "t": "direct", "c": null, "e": [["tile_f", "sp", [32, 1, 4, 4]], ["tile_y", "sp", [7, 1, 1, 1]], ["tile_x", "sp", [1, 7, 1, 1]], ["auto_unroll_max_step", "ot", 0], ["unroll_explicit", "ot", 0]]}], "r": [[0.028221802], 0, 1.5789644718170166, 1551919169.0258257], "v": 0.1}
{"i": ["opencl -device=adreno", "topi_nn_conv2d", [["TENSOR", [1, 512, 14, 14], "float32"], ["TENSOR", [512, 512, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 512, 14, 14, "float32"], [512, 512, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 1147973, "t": "direct", "c": null, "e": [["tile_f", "sp", [32, 8, 2, 1]], ["tile_y", "sp", [2, 7, 1, 1]], ["tile_x", "sp", [2, 1, 7, 1]], ["tile_rc", "sp", [512, 1]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 0]]}], "r": [[0.2915327811], 0, 74.13661241531372, 1551919250.1008255], "v": 0.1}
{"i": ["opencl -device=adreno", "topi_nn_depthwise_conv2d_nchw", [["TENSOR", [1, 512, 14, 14], "float32"], ["TENSOR", [512, 1, 3, 3], "float32"], [1, 1], [1, 1], [1, 1], "float32"], {}, ["depthwise_conv2d_nchw", [1, 512, 14, 14, "float32"], [512, 1, 3, 3, "float32"], [1, 1], [1, 1], [1, 1], "float32"], {"i": 261820, "t": "direct", "c": null, "e": [["tile_f", "sp", [64, 2, 4, 1]], ["tile_y", "sp", [2, 1, 7, 1]], ["tile_x", "sp", [1, 7, 1, 2]], ["auto_unroll_max_step", "ot", 256], ["unroll_explicit", "ot", 1]]}], "r": [[0.0127529895], 0, 3.613858222961426, 1551919256.5462072], "v": 0.1}
{"i": ["opencl -device=adreno", "topi_nn_conv2d", [["TENSOR", [1, 256, 14, 14], "float32"], ["TENSOR", [512, 256, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 256, 14, 14, "float32"], [512, 256, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 1377259, "t": "direct", "c": null, "e": [["tile_f", "sp", [16, 16, 1, 2]], ["tile_y", "sp", [7, 1, 2, 1]], ["tile_x", "sp", [1, 2, 7, 1]], ["tile_rc", "sp", [4, 64]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 0]]}], "r": [[0.1201739687], 0, 21.88442587852478, 1551919279.8842967], "v": 0.1}
{"i": ["opencl -device=adreno", "topi_nn_depthwise_conv2d_nchw", [["TENSOR", [1, 256, 28, 28], "float32"], ["TENSOR", [256, 1, 3, 3], "float32"], [2, 2], [1, 1], [1, 1], "float32"], {}, ["depthwise_conv2d_nchw", [1, 256, 28, 28, "float32"], [256, 1, 3, 3, "float32"], [2, 2], [1, 1], [1, 1], "float32"], {"i": 59948, "t": "direct", "c": null, "e": [["tile_f", "sp", [64, 1, 2, 2]], ["tile_y", "sp", [1, 1, 7, 2]], ["tile_x", "sp", [2, 1, 7, 1]], ["auto_unroll_max_step", "ot", 256], ["unroll_explicit", "ot", 0]]}], "r": [[0.007445703200000001], 0, 0.7793385982513428, 1551919322.245936], "v": 0.1}
{"i": ["opencl -device=adreno", "topi_nn_conv2d", [["TENSOR", [1, 256, 28, 28], "float32"], ["TENSOR", [256, 256, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 256, 28, 28, "float32"], [256, 256, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 8375960, "t": "direct", "c": null, "e": [["tile_f", "sp", [1, 32, 4, 2]], ["tile_y", "sp", [4, 7, 1, 1]], ["tile_x", "sp", [1, 1, 7, 4]], ["tile_rc", "sp", [16, 16]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 0], ["unroll_explicit", "ot", 1]]}], "r": [[0.9702054840000001], 0, 31.55023217201233, 1551919354.8292158], "v": 0.1}
{"i": ["opencl -device=adreno", "topi_nn_depthwise_conv2d_nchw", [["TENSOR", [1, 256, 28, 28], "float32"], ["TENSOR", [256, 1, 3, 3], "float32"], [1, 1], [1, 1], [1, 1], "float32"], {}, ["depthwise_conv2d_nchw", [1, 256, 28, 28, "float32"], [256, 1, 3, 3, "float32"], [1, 1], [1, 1], [1, 1], "float32"], {"i": 1127824, "t": "direct", "c": null, "e": [["tile_f", "sp", [8, 16, 1, 2]], ["tile_y", "sp", [1, 1, 4, 7]], ["tile_x", "sp", [7, 1, 4, 1]], ["auto_unroll_max_step", "ot", 256], ["unroll_explicit", "ot", 1]]}], "r": [[0.0374450677], 0, 7.278401136398315, 1551919363.329563], "v": 0.1}
{"i": ["opencl -device=adreno", "topi_nn_conv2d", [["TENSOR", [1, 128, 28, 28], "float32"], ["TENSOR", [256, 128, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 128, 28, 28, "float32"], [256, 128, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 217071, "t": "direct", "c": null, "e": [["tile_f", "sp", [4, 4, 4, 4]], ["tile_y", "sp", [1, 1, 4, 7]], ["tile_x", "sp", [1, 4, 1, 7]], ["tile_rc", "sp", [128, 1]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 0], ["unroll_explicit", "ot", 0]]}], "r": [[0.349284354], 0, 11.665324687957764, 1551919376.1536279], "v": 0.1}
{"i": ["opencl -device=adreno", "topi_nn_depthwise_conv2d_nchw", [["TENSOR", [1, 128, 56, 56], "float32"], ["TENSOR", [128, 1, 3, 3], "float32"], [2, 2], [1, 1], [1, 1], "float32"], {}, ["depthwise_conv2d_nchw", [1, 128, 56, 56, "float32"], [128, 1, 3, 3, "float32"], [2, 2], [1, 1], [1, 1], "float32"], {"i": 33828, "t": "direct", "c": null, "e": [["tile_f", "sp", [1, 2, 4, 16]], ["tile_y", "sp", [14, 2, 1, 1]], ["tile_x", "sp", [7, 2, 2, 1]], ["auto_unroll_max_step", "ot", 0], ["unroll_explicit", "ot", 0]]}], "r": [[0.051596276100000005], 0, 2.215299606323242, 1551919396.7941692], "v": 0.1}
{"i": ["opencl -device=adreno", "topi_nn_conv2d", [["TENSOR", [1, 128, 56, 56], "float32"], ["TENSOR", [128, 128, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 128, 56, 56, "float32"], [128, 128, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 3150382, "t": "direct", "c": null, "e": [["tile_f", "sp", [8, 2, 8, 1]], ["tile_y", "sp", [1, 28, 2, 1]], ["tile_x", "sp", [28, 1, 2, 1]], ["tile_rc", "sp", [8, 16]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 0], ["unroll_explicit", "ot", 0]]}], "r": [[0.1357756562], 0, 4.70797061920166, 1551919406.3744864], "v": 0.1}
{"i": ["opencl -device=adreno", "topi_nn_depthwise_conv2d_nchw", [["TENSOR", [1, 128, 56, 56], "float32"], ["TENSOR", [128, 1, 3, 3], "float32"], [1, 1], [1, 1], [1, 1], "float32"], {}, ["depthwise_conv2d_nchw", [1, 128, 56, 56, "float32"], [128, 1, 3, 3, "float32"], [1, 1], [1, 1], [1, 1], "float32"], {"i": 1377910, "t": "direct", "c": null, "e": [["tile_f", "sp", [16, 1, 2, 4]], ["tile_y", "sp", [4, 1, 7, 2]], ["tile_x", "sp", [1, 4, 2, 7]], ["auto_unroll_max_step", "ot", 256], ["unroll_explicit", "ot", 0]]}], "r": [[0.0433779843], 0, 2.1917519569396973, 1551919466.5734634], "v": 0.1}
{"i": ["opencl -device=adreno", "topi_nn_conv2d", [["TENSOR", [1, 64, 56, 56], "float32"], ["TENSOR", [128, 64, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 64, 56, 56, "float32"], [128, 64, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 7564055, "t": "direct", "c": null, "e": [["tile_f", "sp", [2, 2, 4, 8]], ["tile_y", "sp", [2, 1, 2, 14]], ["tile_x", "sp", [7, 1, 1, 8]], ["tile_rc", "sp", [16, 4]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 512], ["unroll_explicit", "ot", 0]]}], "r": [[0.102741901], 0, 60.07958197593689, 1551919527.946826], "v": 0.1}
{"i": ["opencl -device=adreno", "topi_nn_depthwise_conv2d_nchw", [["TENSOR", [1, 64, 112, 112], "float32"], ["TENSOR", [64, 1, 3, 3], "float32"], [2, 2], [1, 1], [1, 1], "float32"], {}, ["depthwise_conv2d_nchw", [1, 64, 112, 112, "float32"], [64, 1, 3, 3, "float32"], [2, 2], [1, 1], [1, 1], "float32"], {"i": 1819778, "t": "direct", "c": null, "e": [["tile_f", "sp", [16, 4, 1, 1]], ["tile_y", "sp", [2, 1, 4, 7]], ["tile_x", "sp", [28, 1, 1, 2]], ["auto_unroll_max_step", "ot", 0], ["unroll_explicit", "ot", 1]]}], "r": [[0.1589000572], 0, 5.527170419692993, 1551919548.8157806], "v": 0.1}
{"i": ["opencl -device=adreno", "topi_nn_conv2d", [["TENSOR", [1, 32, 112, 112], "float32"], ["TENSOR", [64, 32, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 32, 112, 112, "float32"], [64, 32, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 21574337, "t": "direct", "c": null, "e": [["tile_f", "sp", [16, 2, 1, 2]], ["tile_y", "sp", [7, 4, 1, 4]], ["tile_x", "sp", [7, 8, 2, 1]], ["tile_rc", "sp", [16, 2]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 0]]}], "r": [[1.04930626], 0, 33.70065188407898, 1551919583.686777], "v": 0.1}
{"i": ["opencl -device=adreno", "topi_nn_conv2d", [["TENSOR", [1, 3, 224, 224], "float32"], ["TENSOR", [32, 3, 3, 3], "float32"], [2, 2], [1, 1], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 3, 224, 224, "float32"], [32, 3, 3, 3, "float32"], [2, 2], [1, 1], [1, 1], "NCHW", "float32"], {"i": 15002129, "t": "direct", "c": null, "e": [["tile_f", "sp", [2, 8, 2, 1]], ["tile_y", "sp", [28, 1, 1, 4]], ["tile_x", "sp", [16, 1, 1, 7]], ["tile_rc", "sp", [1, 3]], ["tile_ry", "sp", [3, 1]], ["tile_rx", "sp", [1, 3]], ["auto_unroll_max_step", "ot", 512], ["unroll_explicit", "ot", 0]]}], "r": [[0.7321428955], 0, 23.286165952682495, 1551919609.88036], "v": 0.1}

One of my question is about “why TVM will fail after tuning”,

Another question is why a small work_size like below can cause CL_INVALID_WORK_GROUP_SIZE error? does TVM set __attribute__((reqd_work_group_size(X, Y, Z))) during cross compile for android OpenCL?

  1. global_work_size=7 local_work_size:7
  2. global_work_size=1 local_work_size:1
  3. global_work_size=16 local_work_size:16

Could you help on the 2nd question?

If there is a warning at the end of tuning that looks like that, then it suggests that there is a workload that failed to find a valid configuration during tuning. How many trials are you allocating per workload/task? We could also verify this by looking at the entire tuning log (the temporary one rather than the one that only contains the best config for each task). The number of lines in the log file should match the number of “tasks” extracted by AutoTVM.

As far as I know, the workgroup size is a device dependent parameter for OpenCL that the OpenCL runtime itself will enforce. I do not think TVM will set this.

To summarize, I think the potential solutions are:

  1. See if using more trials (searching more configs) during autotuning will yield a valid config.
  2. Modify the template if there are indeed no valid configurations for this hardware device.

Further investigation show “CL_KERNEL_WORK_GROUP_SIZE” is the root cause for “small work_size can cause CL_INVALID_WORK_GROUP_SIZE error”.

CL_KERNEL_WORK_GROUP_SIZE is a per kernel value, which depends on the number of general-purpose registers that are needed to run the kernel, it’s not a fixed value, and for the failed kernel, when I print CL_KERNEL_WORK_GROUP_SIZE out, it’s indeed smaller than the kernel size.

and when I use a larger “n_trial” the failed kernel will be different. e.g. “n_trial” = 10, the tuned model will fail on fuse_conv2d_broadcast_add_relu_1_kernel0, if I change n_trial larger, if will fail on fuse_conv2d_broadcast_add_relu_5_kernel0 or fuse_conv2d_broadcast_add_relu_10_kernel0… etc.

so I think “not giving large enough n_trial” maybe the problem, so I changed it to a larger number of 2000. but then the problem is: TVM RPC client crushed after 20+ hours run, and the tuning program also terminated with error msg something like saying “could not find device”,
My questions is: is there a way to checkpoint the tuning progress and restart from where it fails/killed? if not, where should I dig into the code to add that functionality? as it would really help me as the TVM RPC App is not stable on my device.

Thanks in advance!

Currently supporting full pause and resume is a little involved, as it would require saving and restoring the state of the cost model and SA optimizer. This is doable, but you can also just see if adding your own logic to the RPCRunner (e.g., https://github.com/dmlc/tvm/blob/695647db94ca9dca1661c5d41275ef4a86489dc4/python/tvm/autotvm/measure/measure_methods.py#L193) would work well enough for you. Unfortunately in either case this requires you to periodically check the status of your devices.

Thank you, I think a more simple solution for me, is to ignore the "Cannot get remote devices from the tracker" exception/check, and ask the tuning process to sleep and recheck remote device every 5 or 10 sec.

Basically in measure_methods.py: about line:190, in function set_task(), I’ll repeatedly check remote device, in this way, no matter when my TVM RPC App crushed, it will not terminate my tuning process, when the App restart and be online again, the tuning can continue.

Any concern on this solution?

I think that could work, but you will need someone to restart the App on the phone if it does crash. The only thing I can think of is that there may be a timeout mechanism built into one of the runner/executor functions, so I would do a “simulated failure” experiment to see if your modification works:

  1. start tuning
  2. at some point just disconnect/kill the RPC server
  3. verify that the tuning stops instead of exiting or producing measurement timeout errors

Finally, if you think you can contribute information/improvements on the RPC app, please let us know, as we have already spent a good amount of time trying to improve its stability.

To update the tuning for GPU using OpenCL:

  1. I’m not sure whether this is a bug, but when I use a very small early_stopping number(e.g. 10 or 20), we don’t have this CL_INVALID_WORK_GROUP_SIZE error anymore, but when I set it to a little bit larger number(e.g. 50+),
  2. GPU latency is much larger than Arm CPU tuning result, GFLOPS during tuning is always less than 10; but ARM CPU’s GFLOPS are always larger than 20; the final latency for ARM CPU can achieve 13ms, but GPU is about 100-200ms (maybe due to issue1 so we could not set large enough early_stopping number, but one tensor tuning still show bad performance for GPU tuning with OpenCL); Is this normal (CPU performance better than GPU)? or there may exist something I haven’t setup correctly?
  1. It is not a bug if you only see it intermittently during tuning. Using a larger number of trials can cause this to occur just by chance (you are trying more schedule configurations). It is a bug if you see this error after tuning.

  2. You need to tune with many more trials (e.g., 1000-2000) to compare the results. The final performance will also depend on the exact GPU vs. CPU. On mobile devices, the differences are not as clear cut as on the desktop, as big ARM cores are quite fast relative to Mali GPUs.

thanks, we see this error after tuning(which means finish tuning with very small early_stopping number, we are less likely to see this error, but with large early_stopping number, we always see this error)
and due to above issue, setting many more trials doesn’t work as it will early stop.

From tuning log, I do see each of the function involved in auto-tuning get valid tuning result; but the final compile ‘.so’ file still have this issue; is this due to when fuse tuned/untuned operator after tuning, an invalid kernel has been generated?

It would be great if we can isolate the issue. Can you reproduce it with just a single operator? If so, it would be useful to inspect the logs for both the small early_stopping and large early_stopping case.