Keen to post this:
- One outstanding sample obtained in less than 120 iteration with proposed PR #3642 +
patch_here
addressing P and expandingt1
,t2
,unroll
,yt
a bit:
before=98GFOPS
after=303GFLOPS
[Task 13/20 (1, 384, 26, 26)|(256, 384, 3, 3)] (conv2d) {98.48 GFLOPS /winograd} Current/Best: 288.51/ 303.59 GFLOPS | Progress: (120/2000) | 349.65 s
- Will re-tune all mali TOPHUB entry and post results (will take a while), it seems much better results can be obtained on float32 and float16 too.