I print log by graph_time_debug and find that fused_nn_contrib_conv2d_NCHWc_9 took ten times as long as the other op. But GFLOPS is great in auto-tuning.
Node Name Ops Time(us) Time(%) Shape Inputs Outputs
--------- --- -------- ------- ----- ------ -------
fused_nn_contrib_conv2d_NCHWc_9 fused_nn_contrib_conv2d_NCHWc_9 6791.89 12.747 (1, 6, 71, 71, 32) 2 1
fused_nn_contrib_conv2d_NCHWc_5 fused_nn_contrib_conv2d_NCHWc_5 2484.37 4.663 (1, 48, 17, 17, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_11 fused_nn_contrib_conv2d_NCHWc_11 2468.05 4.632 (1, 8, 147, 147, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_12 fused_nn_contrib_conv2d_NCHWc_12 1185.52 2.225 (1, 4, 147, 147, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_381 fused_nn_contrib_conv2d_NCHWc_38 810.38 1.521 (1, 24, 8, 8, 16) 2 1
fused_nn_contrib_conv2d_NCHWc_38 fused_nn_contrib_conv2d_NCHWc_38 803.933 1.509 (1, 24, 8, 8, 16) 2 1
fused_nn_contrib_conv2d_NCHWc_331 fused_nn_contrib_conv2d_NCHWc_33 711.435 1.335 (1, 24, 17, 17, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_32 fused_nn_contrib_conv2d_NCHWc_32 695.281 1.305 (1, 24, 17, 17, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_322 fused_nn_contrib_conv2d_NCHWc_32 680.986 1.278 (1, 24, 17, 17, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_33 fused_nn_contrib_conv2d_NCHWc_33 679.579 1.275 (1, 24, 17, 17, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_323 fused_nn_contrib_conv2d_NCHWc_32 658.919 1.237 (1, 24, 17, 17, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_333 fused_nn_contrib_conv2d_NCHWc_33 654.632 1.229 (1, 24, 17, 17, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_321 fused_nn_contrib_conv2d_NCHWc_32 653.408 1.226 (1, 24, 17, 17, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_332 fused_nn_contrib_conv2d_NCHWc_33 651.086 1.222 (1, 24, 17, 17, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_16 fused_nn_contrib_conv2d_NCHWc_16 630.064 1.183 (1, 12, 35, 35, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_162 fused_nn_contrib_conv2d_NCHWc_16 614.8 1.154 (1, 12, 35, 35, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_161 fused_nn_contrib_conv2d_NCHWc_16 612.939 1.15 (1, 12, 35, 35, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_421 fused_nn_contrib_conv2d_NCHWc_42 587.995 1.104 (1, 28, 8, 8, 16) 2 1
fused_layout_transform_transpose_multiply_add_nn_relu_transpose fused_layout_transform_transpose_multiply_add_nn_relu_transpose 561.455 1.054 (1, 192, 71, 71) 3 1
fused_nn_contrib_conv2d_NCHWc_271 fused_nn_contrib_conv2d_NCHWc_27 538.25 1.01 (1, 24, 17, 17, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_27 fused_nn_contrib_conv2d_NCHWc_27 528.974 0.993 (1, 24, 17, 17, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_30 fused_nn_contrib_conv2d_NCHWc_30 519.549 0.975 (1, 24, 17, 17, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_301 fused_nn_contrib_conv2d_NCHWc_30 515.873 0.968 (1, 24, 17, 17, 8) 2 1
fused_layout_transform_transpose_multiply_add_nn_relu_transpose_1 fused_layout_transform_transpose_multiply_add_nn_relu_transpose_1 508.914 0.955 (1, 64, 147, 147) 3 1
fused_nn_contrib_conv2d_NCHWc_4111 fused_nn_contrib_conv2d_NCHWc_41 496.091 0.931 (1, 12, 8, 8, 32) 2 1
fused_nn_contrib_conv2d_NCHWc_14 fused_nn_contrib_conv2d_NCHWc_14 480.63 0.902 (1, 4, 35, 35, 16) 2 1
fused_nn_contrib_conv2d_NCHWc_142 fused_nn_contrib_conv2d_NCHWc_14 478.135 0.897 (1, 4, 35, 35, 16) 2 1
fused_nn_contrib_conv2d_NCHWc_141 fused_nn_contrib_conv2d_NCHWc_14 474.557 0.891 (1, 4, 35, 35, 16) 2 1
fused_nn_contrib_conv2d_NCHWc_22 fused_nn_contrib_conv2d_NCHWc_22 471.106 0.884 (1, 24, 17, 17, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_311 fused_nn_contrib_conv2d_NCHWc_31 470.568 0.883 (1, 20, 17, 17, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_282 fused_nn_contrib_conv2d_NCHWc_28 468.939 0.88 (1, 20, 17, 17, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_31 fused_nn_contrib_conv2d_NCHWc_31 443.32 0.832 (1, 20, 17, 17, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_312 fused_nn_contrib_conv2d_NCHWc_31 442.435 0.83 (1, 20, 17, 17, 8) 2 1
fused_nn_avg_pool2d_10 fused_nn_avg_pool2d_10 442.288 0.83 (1, 288, 35, 35) 1 1
fused_nn_contrib_conv2d_NCHWc_313 fused_nn_contrib_conv2d_NCHWc_31 442.159 0.83 (1, 20, 17, 17, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_28 fused_nn_contrib_conv2d_NCHWc_28 433.852 0.814 (1, 20, 17, 17, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_281 fused_nn_contrib_conv2d_NCHWc_28 432.233 0.811 (1, 20, 17, 17, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_283 fused_nn_contrib_conv2d_NCHWc_28 431.392 0.81 (1, 20, 17, 17, 8) 2 1
fused_nn_contrib_conv2d_NCHWc_1 fused_nn_contrib_conv2d_NCHWc_1 412.48 0.774 (1, 10, 8, 8, 32) 2 1
fused_nn_contrib_conv2d_NCHWc_25 fused_nn_contrib_conv2d_NCHWc_25 410.248 0.77 (1, 24, 17, 17, 8) 2 1
fused_nn_avg_pool2d_9 fused_nn_avg_pool2d_9 393.421 0.738 (1, 256, 35, 35) 1 1
fused_nn_contrib_conv2d_NCHWc_39 fused_nn_contrib_conv2d_NCHWc_39 371.392 0.697 (1, 14, 8, 8, 32) 2 1
fused_nn_contrib_conv2d_NCHWc_291 fused_nn_contrib_conv2d_NCHWc_29 370.567 0.695 (1, 10, 17, 17, 16) 2 1
fused_nn_contrib_conv2d_NCHWc_292 fused_nn_contrib_conv2d_NCHWc_29 365.842 0.687 (1, 10, 17, 17, 16) 2 1
fused_nn_contrib_conv2d_NCHWc_293 fused_nn_contrib_conv2d_NCHWc_29 363.063 0.681 (1, 10, 17, 17, 16) 2 1
fused_nn_contrib_conv2d_NCHWc_29 fused_nn_contrib_conv2d_NCHWc_29 359.889 0.675 (1, 10, 17, 17, 16) 2 1
fused_nn_contrib_conv2d_NCHWc_173 fused_nn_contrib_conv2d_NCHWc_17 342.677 0.643 (1, 6, 35, 35, 16) 2 1
fused_nn_contrib_conv2d_NCHWc_172 fused_nn_contrib_conv2d_NCHWc_17 335.739 0.63 (1, 6, 35, 35, 16) 2 1
fused_nn_contrib_conv2d_NCHWc_17 fused_nn_contrib_conv2d_NCHWc_17 329.798 0.619 (1, 6, 35, 35, 16) 2 1
fused_nn_contrib_conv2d_NCHWc_171 fused_nn_contrib_conv2d_NCHWc_17 329.348 0.618 (1, 6, 35, 35, 16) 2 1
fused_nn_contrib_conv2d_NCHWc_410 fused_nn_contrib_conv2d_NCHWc_4 328.689 0.617 (1, 6, 17, 17, 32) 2 1
fused_nn_contrib_conv2d_NCHWc_43 fused_nn_contrib_conv2d_NCHWc_4 322.785 0.606 (1, 6, 17, 17, 32) 2 1
fused_nn_contrib_conv2d_NCHWc_49 fused_nn_contrib_conv2d_NCHWc_4 322.402 0.605 (1, 6, 17, 17, 32) 2 1
fused_nn_contrib_conv2d_NCHWc_411 fused_nn_contrib_conv2d_NCHWc_4 315.062 0.591 (1, 6, 17, 17, 32) 2 1
fused_nn_contrib_conv2d_NCHWc_44 fused_nn_contrib_conv2d_NCHWc_4 311.779 0.585 (1, 6, 17, 17, 32) 2 1
fused_nn_contrib_conv2d_NCHWc_4 fused_nn_contrib_conv2d_NCHWc_4 311.495 0.585 (1, 6, 17, 17, 32) 2 1
fused_nn_contrib_conv2d_NCHWc_41 fused_nn_contrib_conv2d_NCHWc_4 310.259 0.582 (1, 6, 17, 17, 32) 2 1
fused_nn_contrib_conv2d_NCHWc_45 fused_nn_contrib_conv2d_NCHWc_4 309.839 0.582 (1, 6, 17, 17, 32) 2 1
fused_nn_contrib_conv2d_NCHWc_42 fused_nn_contrib_conv2d_NCHWc_4 309.302 0.581 (1, 6, 17, 17, 32) 2 1
fused_nn_avg_pool2d_113 fused_nn_avg_pool2d_11 302.062 0.567 (1, 768, 17, 17) 1 1
fused_nn_contrib_conv2d_NCHWc_48 fused_nn_contrib_conv2d_NCHWc_4 300.728 0.564 (1, 6, 17, 17, 32) 2 1
fused_nn_contrib_conv2d_NCHWc_47 fused_nn_contrib_conv2d_NCHWc_4 300.177 0.563 (1, 6, 17, 17, 32) 2 1
fused_nn_contrib_conv2d_NCHWc_46 fused_nn_contrib_conv2d_NCHWc_4 299.445 0.562 (1, 6, 17, 17, 32) 2 1
fused_nn_contrib_conv2d_NCHWc_261 fused_nn_contrib_conv2d_NCHWc_26 294.235 0.552 (1, 16, 17, 17, 8) 2 1
fused_nn_avg_pool2d_8 fused_nn_avg_pool2d_8 294.226 0.552 (1, 192, 35, 35) 1 1
fused_nn_contrib_conv2d_NCHWc_431 fused_nn_contrib_conv2d_NCHWc_43 292.24 0.548 (1, 6, 8, 8, 32) 2 1
fused_nn_contrib_conv2d_NCHWc_26 fused_nn_contrib_conv2d_NCHWc_26 291.928 0.548 (1, 16, 17, 17, 8) 2 1
......
-
The best configuration was obtained by auto-tuning first, and then the graph optimization is performed.However, this best configuration is not necessarily suitable for after graph optimization.Shouldn’t that be a problem?
-
I want to optimize an op individually and then modify the best configuration for auto-tuning.Are there any tutorials?