A typical opencl kernel looks like
__kernel void helloworld(__global char* in, __global char* out)
{
int num = get_global_id(0);
out[num] = in[num] + 1;
}
, where get_global_id fetches the id of a global dimension, and kernel would utilize available hardware threads to compute along such dimension.
In addition, while OpenCL is originally designed to target general-purpose computing and the design of VTA is domain-specific, I think bridging OpenCL software stack into VTA hardware design would bring a lot of issues, and would degrade the actual performance.