Name Duration (us) Percent Device Count Argument Shapes
dense3 42,834.04 24.67 cuda0 12 float32[197, 3072], float32[768, 3072], float32[197, 768]
dense2 38,150.43 21.97 cuda0 12 float32[197, 768], float32[3072, 768], float32[197, 3072]
dense 36,904.06 21.25 cuda0 12 float32[197, 768], float32[2304, 768], float32[197, 2304]
fused_dense1_add5 10,385.50 5.98 cuda0 12 float32[197, 768], float32[768, 768], float32[1, 768], float32[197, 768]
fused_mean_add2_rsqrt_multiply_multiply1_add3 1,878.97 1.08 cuda0 25 float32[1, 197, 768], float32[], float32[1, 197, 768], float32[768], float32[768], float32[1, 197, 768]
fused_mean_subtract 1,694.72 0.98 cuda0 25 float32[1, 197, 768], float32[1, 197, 768]
batch_matmul 1,541.43 0.89 cuda0 12 float32[12, 197, 64], float32[12, 197, 64], float32[12, 197, 197]
fast_softmax 1,219.89 0.70 cuda0 12 float32[1, 12, 197, 197], float32[1, 12, 197, 197]
batch_matmul1 1,007.89 0.58 cuda0 12 float32[12, 197, 197], float32[12, 64, 197], float32[12, 197, 64]
fused_dense4_add9 354.53 0.20 cuda0 1 float32[1, 768], float32[1000, 768], float32[1, 1000], float32[1, 1000]
fused_conv2d_add 347.13 0.20 cuda0 1 float32[1, 3, 224, 224], float32[768, 3, 16, 16], float32[1, 768, 1, 1], float32[1, 768, 14, 14]
fused_reshape14_add6_divide_fast_erf_add7_multiply4_multiply5_reshape15 302.39 0.17 cuda0 12 float32[197, 3072], float32[3072], float32[], float32[], float32[], float32[197, 3072]
fused_reshape4_transpose3_reshape5_transpose4_multiply3_reshape7_transpose5 176.86 0.10 cuda0 12 float32[197, 1, 768], float32[1], float32[12, 197, 64]
fused_reshape2_add4_reshape3_expand_dims_transpose2_squeeze 149.37 0.09 cuda0 12 float32[197, 2304], float32[2304], float32[3, 197, 1, 768]
take 126.10 0.07 cuda0 36 float32[3, 197, 1, 768], int64[], float32[197, 1, 768]
fused_reshape4_transpose3_reshape5_multiply2_reshape6 111.29 0.06 cuda0 12 float32[197, 1, 768], float32[1], float32[12, 197, 64]
fused_reshape4_transpose3_reshape10_transpose6 98.29 0.06 cuda0 12 float32[197, 1, 768], float32[12, 64, 197]
power 97.80 0.06 cuda0 25 float32[1, 197, 768], float32[], float32[1, 197, 768]
fused_reshape16_add8_add1 96.25 0.06 cuda0 12 float32[197, 768], float32[768], float32[1, 197, 768], float32[1, 197, 768]
fused_reshape5_transpose7_reshape11 88.80 0.05 cuda0 12 float32[12, 197, 64], float32[197, 768]
fused_reshape12_transpose8_add1 66.91 0.04 cuda0 12 float32[197, 768], float32[1, 197, 768], float32[1, 197, 768]
fused_transpose1_reshape1 61.43 0.04 cuda0 12 float32[1, 197, 768], float32[197, 768]
vm.builtin.reshape 16.37 0.01 cuda0 12 float32[12, 197, 197]
vm.builtin.reshape 16.37 0.01 cuda0 12 float32[1, 12, 197, 197]
vm.builtin.reshape 14.32 0.01 cuda0 12 float32[1, 197, 768]
fused_reshape_transpose 7.17 0.00 cuda0 1 float32[1, 768, 14, 14], float32[1, 196, 768]
fused_concatenate_add1 5.38 0.00 cuda0 1 float32[1, 1, 768], float32[1, 196, 768], float32[1, 197, 768], float32[1, 197, 768]
take1 4.09 0.00 cuda0 1 float32[1, 197, 768], int64[], float32[1, 768]
vm.builtin.match_shape 2.05 0.00 cuda0 1 float32[1, 3, 224, 224]
vm.builtin.check_tensor_info 1.02 0.00 cuda0 1 float32[1, 3, 224, 224]
Sum 137,760.86 79.33 346 Total 173,646.79 cpu0 1 Total 138,202.12 cuda0 1
Configuration
Number of threads: 48 Executor: VM
One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details. Name Duration (us) Percent Device Count Argument Shapes Hash VM::Argument Shapes data_layout kernel_layout out_layout vm_mod_fused_nn_conv2d_nn_bias_add 29,722.62 20.11 cuda0 1 float32[1, 3, 224, 224], float32[768, 3, 16, 16], float32[768], float32[1, 768, 14, 14] 429a84de173fb850 NCHW OIHW vm_mod_fused_nn_batch_matmul 7,756.85 5.25 cuda0 12 float32[12, 197, 64], float32[12, 197, 64], float32[12, 197, 197] 899ca160dc6c15ba
vm_mod_fused_nn_dense_2 5,562.42 3.76 cuda0 12 float32[197, 3072], float32[768, 3072], float32[197, 768] 2bccb247e8f4d2aa
vm_mod_fused_nn_dense_1 5,039.32 3.41 cuda0 12 float32[197, 768], float32[3072, 768], float32[197, 3072] 46944603288f0973
vm_mod_fused_nn_dense 3,830.84 2.59 cuda0 12 float32[197, 768], float32[2304, 768], float32[197, 2304] 1cbb165ee78cf9d8
vm_mod_fused_nn_batch_matmul_1 1,742.97 1.18 cuda0 12 float32[12, 197, 197], float32[12, 64, 197], float32[12, 197, 64] 796513181fb550c6
vm_mod_fused_nn_dense_add 1,433.73 0.97 cuda0 12 float32[197, 768], float32[768, 768], float32[768], float32[197, 768] d4d2e7c97d5d8b5d
VM::UnknownOp 1,221.29 0.83 cpu0 590
vm_mod_fused_nn_softmax 1,196.44 0.81 cuda0 12 float32[1, 12, 197, 197], float32[1, 12, 197, 197] fc3e3d9c7014e1a2
VM::AllocStorage 368.36 0.25 cuda0 283
vm_mod_fused_power_mean 345.08 0.23 cuda0 25 float32[1, 197, 768], float32[], float32[1, 197, 1] 3f95947451ab9751
vm_mod_fused_mean 153.69 0.10 cuda0 25 float32[1, 197, 768], float32[1, 197, 1] b8d6cc7d81a7b6b2
VM::ReshapeTensor 146.37 0.10 cpu0 24
vm_mod_fused_subtract 84.31 0.06 cuda0 25 float32[1, 197, 768], float32[1, 197, 1], float32[1, 197, 768] 4890abf5cbc0ea15
VM::AllocTensor 79.81 0.05 cuda0 60 float32[197, 768]
VM::AllocTensor 65.49 0.04 cuda0 50 float32[1, 197, 768]
vm_mod_fused_reshape_add_divide_erf_add_multiply_multiply_reshape 63.57 0.04 cuda0 12 float32[197, 3072], float32[3072], float32[], float32[], float32[], float32[197, 3072] 75038c34748b3f90
VM::AllocTensor 63.44 0.04 cuda0 50 float32[1, 197, 1]
vm_mod_fused_reshape_add_reshape_expand_dims_transpose_squeeze 58.61 0.04 cuda0 12 float32[197, 2304], float32[2304], float32[3, 197, 1, 768] 1cac7a7434e9fde2
vm_mod_fused_take_reshape_transpose_reshape_transpose 56.50 0.04 cuda0 12 float32[3, 197, 1, 768], int64[], float32[12, 64, 197] 41479e6b9d6b8ace
vm_mod_fused_add_rsqrt_multiply_multiply_add_transpose_reshape 53.75 0.04 cuda0 12 float32[1, 197, 1], float32[], float32[1, 197, 768], float32[768], float32[768], float32[197, 768] 6011d0a41b960753
vm_mod_fused_add_rsqrt_multiply_multiply_add_reshape 52.44 0.04 cuda0 12 float32[1, 197, 1], float32[], float32[1, 197, 768], float32[768], float32[768], float32[197, 768] 555d3bbd09a6a518
vm_mod_fused_reshape_add_add 50.16 0.03 cuda0 12 float32[197, 768], float32[768], float32[1, 197, 768], float32[1, 197, 768] b6846f7871119c5b
vm_mod_fused_take_reshape_transpose_reshape_transpose_multiply_reshape_transpose 48.34 0.03 cuda0 12 float32[3, 197, 1, 768], int64[], float32[1], float32[12, 197, 64] 2c0a06cf78ffaf2b
VM::AllocTensor 46.04 0.03 cuda0 36 float32[12, 197, 64]
vm_mod_fused_take_reshape_transpose_reshape_multiply_reshape 44.34 0.03 cuda0 12 float32[3, 197, 1, 768], int64[], float32[1], float32[12, 197, 64] fa9456d2507e73dd
vm_mod_fused_reshape_transpose_add 43.00 0.03 cuda0 12 float32[197, 768], float32[1, 197, 768], float32[1, 197, 768] a5a0e4c6d7516ce5
vm_mod_fused_reshape_transpose_reshape 43.00 0.03 cuda0 12 float32[12, 197, 64], float32[197, 768] facd1ad07f0c636a
VM::AllocTensor 31.72 0.02 cuda0 24 float32[197, 3072]
VM::AllocTensor 17.40 0.01 cuda0 12 float32[3, 197, 1, 768]
VM::AllocTensor 17.40 0.01 cuda0 12 float32[197, 2304]
VM::AllocTensor 16.37 0.01 cuda0 12 float32[12, 64, 197]
vm_mod_fused_add_rsqrt_multiply_multiply_add_take 3.26 0.00 cuda0 1 float32[1, 197, 1], float32[], float32[1, 197, 768], float32[768], float32[768], int64[], float32[1, 768] 80985c00f17dfb35 VM::AllocTensor 2.05 0.00 cuda0 1 float32[1, 768] VM::AllocTensor 1.02 0.00 cuda0 1 float32[1, 1000] VM::AllocTensor 1.02 0.00 cuda0 1 float32[1, 768, 14, 14]
Sum 59,503.98 40.26 1,463 Total 59,266.05 cuda0 1 Total 147,807.59 cpu0 1
When profiling the same ViT model on my machine, an RTX-4090, it can be observed that using relay for layer optimization yields better performance compared to relax.