Zero copy between CPU and opencl?

when I use opencl as backend on Intel Graphics,it takes ~4ms to copy data from cpu to gpu, then all inference process will increase 8 ms, how can I reduce this time costs ? I want konw that, the TVMArrayAlloc function use or not use some tricks eg. zero copy(Getting the Most from OpenCL™ 1.2: How to Increase Performance by...) ?

my code :

device = kDLOpenCL

tvm::runtime::Module mod = (*tvm::runtime::Registry::Get(“tvm.graph_executor.create”))(json_data, mod_syslib, device, device_id);

constexpr int dtype_code = kDLFloat; constexpr int dtype_lanes = 1;

constexpr int in_ndim = 4;
constexpr int dtype_bits_in = 32;

constexpr int out_dim = 4;
constexpr int dtype_bits_out = 16;

int64_t in_shape[in_ndim] = {1, 1, InImgHeight, InImgWidth};
int64_t out_shape[out_dim] = {1, 1, OutImgHeight, OutImgWidth};

TVMArrayAlloc(in_shape, in_ndim, dtype_code, dtype_bits_in, dtype_lanes, kDLCPU, device_id, &input);
TVMArrayAlloc(out_shape, out_dim, dtype_code, dtype_bits_out, dtype_lanes, kDLCPU, device_id, &output);

set input:

setInputData_sse(in, InImgWidth, InImgHeight, (float*)(input->data));

run inference:

I found that data copy( cpu->copencl and opencl->cpu ) will auto happen,

tvm::runtime::PackedFunc set_input = mod->GetFunction("set_input");
set_input("input0", input);

tvm::runtime::PackedFunc run = mod->GetFunction("run");

run();

tvm::runtime::PackedFunc get_output = mod->GetFunction("get_output");
get_output(0, output);

output: