[RFC][Tensorcore] INT4 end-to-end inference

I found out that gtx1050, gtx3090 does not support the corresponding schedule for cuda. I think 20 series are needed (at lease gtx2080 does support).