Grouped convolution performance penalty

You can copy the schedule for conv2d, this will at least bring some improvement. You need to write specific schedule template for group_conv2d if you want further improvement