Question about VTA papter

I’m reading the paper " A Hardware-Software Blueprint for Flexible Deep Learning Specialization", which is pretty amazing even I’d read the previous paper as well.

I have a question about both the high-level ISA and the micro-op.
It seems the Load high-level instruction has only x-stride but not y-stride.
I guess this comes with some SRAM memory space inefficiency when the computation unit computes convolution with non-1 stride.

Also, I just want to know the VTA compute unit is designed to perform well when the channel dimension is innermost? or just transform them by using some way like im2col. I just wonder this because the GEMM operation needs to load data sequentially from the register array.