Intution on why this int8 algorithm is slower?

Maybe the slowdown is due to int16 fallback? Or, since you modified the compute, the “right” schedule may not be getting called.