Hello @Hzfengsy, If we use tensorized stroe copy instructions in step4 to write back from register to global memory without shared, how can we fuse dense with other injective ops? Could you have some suggests for us?
Hello @Hzfengsy, If we use tensorized stroe copy instructions in step4 to write back from register to global memory without shared, how can we fuse dense with other injective ops? Could you have some suggests for us?