Winograd variable tile_size knob

@merrymercy,

Keen to post this:

  • One outstanding sample obtained in less than 120 iteration with proposed PR #3642 + patch_here addressing P and expanding t1,t2,unroll,yt a bit:
before=98GFOPS
after=303GFLOPS
[Task 13/20 (1, 384, 26, 26)|(256, 384, 3, 3)] (conv2d) {98.48 GFLOPS /winograd} Current/Best:  288.51/ 303.59 GFLOPS | Progress: (120/2000) | 349.65 s
  • Will re-tune all mali TOPHUB entry and post results (will take a while), it seems much better results can be obtained on float32 and float16 too.