[MetaSchedule] [TensorCore]Please help check whether I use cuda-tensorcore to tune the operator

I try to use metaschedule to tune an operator with cuda-tensorcore, and compare with ansor

def meta_opt(self): conv = topi.nn.conv2d_nchw(self.data, self.kernel, 1, 3, 1) print(conv)

    func = te.create_prim_func([self.data, self.kernel, conv])
    ir_module = IRModule({"main": func})

    database = ms.tune_tir(ir_module, "nvidia/geforce-rtx-3090", max_trials_global=1500, work_dir="./tune_tmp", task_name="main", 
                           space=ms.space_generator.PostOrderApply(
                            sch_rules="cuda-tensorcore", postprocs="cuda-tensorcore", mutator_probs="cuda-tensorcore"))
    sch = ms.tir_integration.compile_tir(database, ir_module, "nvidia/geforce-rtx-3090")

    mod = tvm.build(sch.mod, target="cuda")
    a_np = np.random.randint(0, 255, size=(1,3,224,224)).astype(self.dtype)
    b_np = np.random.uniform(size=(64,3,7,7)).astype(self.dtype)
    a_nd = tvm.nd.array(a_np, self.dev)
    b_nd = tvm.nd.array(b_np, self.dev)
    c_nd = tvm.nd.empty((1,64,224,224), dtype=self.dtype, device=self.dev)
    f_timer_after = mod.time_evaluator("main", self.dev)
    print("Time cost of MyModule after meta tuning: %.3f ms" % (f_timer_after(a_nd, b_nd, c_nd).mean * 1000))

But the tuning result has almost no advantage with the ansor result

Time cost of MyModule after meta tuning: 0.047 ms

Time cost of MyModule after ansor tuning: 0.051 ms

I want to know if I really use the cuda-tensorcore to tune my operator, and why the result doesn’t get significant improvement

Thanks

There are a few possibilities:

  • The dtype is float32, while TensorCore rules only support float16 and int8 inputs
  • The input shape of Conv is 1, 3, 224, 224, which is not a good shape for Tensor Cores… As we do not apply im2col, so we need to pad the input to 1, 16, 224, 224 if you force to use Tensor Cores.

You can print the cuda codes by print(mod.imported_modules[0].get_source()) to see if it uses Tensor Core

OK, thanks. I will try later

If we don’t pad input shape to 16, and only the first convolution on input would not use tensorcore? because padding to 16 also brings some cost when doing convolution, which may not be accelerated even using tensorcore.

Hi @Hzfengsy ,

Can you shed some more light on the padding of (1,3,224,224) to (1,16,224,224)?

How must I go about doing this?

I currently am trying to metaschedule a resnet fp16 quantized tflite model with an input shape of (1,224,224,3). How can i get it to hit the tensorcores as per the criteria you have specified above? Please help.

Thanks and regards, Krishna.