[AUTOSCHEDULER][CUDA] Poor perfomance for subtraction kernel

I am getting poor performance (in terms of schedule efficiency) of autoscheduler for simple consecutive subtraction kernel:

 in = te.placeholder((N, H, W), dtype='float')
 out = te.compute((N-1, H, W),  lambda n, y, x: in[n+1, y, x] - in[n, y, x])
 return [in, out]

Example (tir) of schedule “found” for particular input sizes:

primfn(in_1: handle, out_1: handle) -> ()
  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
  buffers = {out: Buffer(out_2: Pointer(float32), float32, [5, 2160, 3840], []),
             in: Buffer(in_2: Pointer(float32), float32, [6, 2160, 3840], [])}
  buffer_map = {in_1: in, out_1: out} {
  attr [IterVar(blockIdx.x: int32, (nullptr), "ThreadIndex", "blockIdx.x")] "thread_extent" = 648000;
  attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 64;
  out_2[((blockIdx.x*64) + threadIdx.x)] = ((float32*)in_2[(((blockIdx.x*64) + threadIdx.x) + 8294400)] - (float32*)in_2[((blockIdx.x*64) + threadIdx.x)])
}

Cuda kernel for this schedule runs on my gpu for 1250us, while simple handmade kernel runs for 950us. Is that expected, or any changes in tuning options/operator “phrasing” can be made to gain better performance?

tvm commit 10fca9c