Thanks for the help, though with this approach I got the classic computation outside GPU loop bound error:
Did you forget to bind?
Variable `placeholder` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
Variable `placeholder` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
Variable `T_dense` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
Variable `T_dense` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
Variable `T_dense` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
File "../src/tir/analysis/verify_memory.cc", line 202
RuntimeError: Memory verification failed with the following errors:
PrimFunc([placeholder, placeholder, T_dense]) attrs={"global_symbol": "fused_nn_dense_73", "tir.noalias": (bool)1, "target": opencl -keys=mali,opencl,gpu -device=mali -max_num_threads=256 -th
read_warp_size=1} {
T_dense[0] = 0f
for (k, 0, 768) {
T_dense[0] = (T_dense[0] + (placeholder[k]*placeholder[k]))
}
}