When I compile a fp16 model into SPIRV, it hit a fatal error at codegen_spirv.cc:453: “Only aligned continuous vector access is allowed in SPIRV”
This is because a TIR LoadNode is neither a scalarized load, nor a vectorized load with a valid RampNode index. The error occurs in “fused_image_resize_kernel0” layer. I dumped the layer’s TIR , but not sure whether it’s a TIR problem or a SPIRV codegen bug. Can someone tell from the following TIR dump?
fused_image_resize_kernel0:
#[version = “0.0.5”] primfn(resize: Pointer(float16x4), placeholder: Pointer(float16x4)) → () attr = {“target”: meta[Target][0], “tir.noalias”: 1, “global_symbol”: “fused_image_resize_kernel0”, “tir.device_thread_axis”: [IterVar(blockIdx.x: int32, (nullptr), “ThreadIndex”, “blockIdx.x”), IterVar(threadIdx.x: int32, (nullptr), “ThreadIndex”, “threadIdx.x”)], “calling_conv”: 2} { attr [IterVar(blockIdx.x, (nullptr), “ThreadIndex”, “blockIdx.x”)] “thread_extent” = 256; attr [IterVar(threadIdx.x, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 256; for (i0.i1.fused.i2.fused.i3.fused.outer.outer: int32, 0, 8) { resize[ramp(((((i0.i1.fused.i2.fused.i3.fused.outer.outer262144) + (@tir.shift_right(((blockIdx.x1024) + (threadIdx.x4)), 15, dtype=int32)32768)) + (@tir.bitwise_and(((blockIdx.x4) + @tir.shift_right(threadIdx.x, 6, dtype=int32)), 127, dtype=int32)256)) + (@tir.bitwise_and(threadIdx.x, 63, dtype=int32)4)), 1, 4)] = cast(float16x4, cast(float32x4, (float16x4)placeholder[((broadcast(((i0.i1.fused.i2.fused.i3.fused.outer.outer65536) + (@tir.shift_right(((blockIdx.x1024) + (threadIdx.x*4)), 15, dtype=int32)*8192)), 4) + (max(min(cast(int32x4, @tir.call_spirv_pure_glsl450(8u32, ((broadcast(0.5f32, 4)cast(float32x4, broadcast(@tir.bitwise_and(((blockIdx.x4) + @tir.shift_right(threadIdx.x, 6, dtype=int32)), 127, dtype=int32), 4))) + broadcast(1e-05f32, 4)), dtype=float32x4)), broadcast(63, 4)), broadcast(0, 4))*broadcast(128, 4))) + max(min(cast(int32x4, @tir.call_spirv_pure_glsl450(8u32, ((broadcast(0.5f32, 4)*cast(float32x4, ramp((@tir.bitwise_and(threadIdx.x, 63, dtype=int32)*4), 1, 4))) + broadcast(1e-05f32, 4)), dtype=float32x4)), broadcast(127, 4)), broadcast(0, 4)))])) } }