In the TVM paper, the fused lstm cell speedup compared with non-fused lstm cell is about 1.4,however, runing the test lstm case under tvm/nnvm/tests/python/frontend/tensorflow/test_forward.py, and using nvprof to profiling the lstm op compiled with tvm, the result indicated that the lstm op is not fused
So does the current TVM code support to do the fusion of lstm cell which described by nnvm core tensor operators?
==5394== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 13.31% 31.264us 10 3.1260us 2.9760us 3.6800us fuse_sigmoid___add_scalar___split_reshape_broadcast_mul_sigmoid_tanh_broadcast_mul_broadcast_add_kernel0
12.25% 28.768us 10 2.8760us 2.7840us 3.5840us fuse_dense_kernel0
11.91% 27.969us 10 2.7960us 2.6240us 3.0720us fuse_sigmoid_tanh_broadcast_mul_kernel0
10.30% 24.192us 10 2.4190us 2.3680us 2.5280us fuse_concatenate_reshape_expand_dims_concatenate_kernel0
9.09% 21.344us 10 2.1340us 2.0800us 2.3680us fuse_split_kernel3
8.57% 20.129us 10 2.0120us 1.9200us 2.5920us fuse_reshape_split_reshape_concatenate_kernel0
8.39% 19.712us 10 1.9710us 1.8560us 2.5920us fuse_split_kernel0
8.33% 19.553us 10 1.9550us 1.8560us 2.3360us fuse_split_kernel2
8.33% 19.552us 10 1.9550us 1.8880us 2.3360us fuse_split_kernel1
3.78% 8.8650us 4 2.2160us 2.0480us 2.5610us [CUDA memcpy DtoH]
3.11% 7.2960us 7 1.0420us 896ns 1.4720us [CUDA memcpy HtoD]
1.69% 3.9680us 2 1.9840us 1.6640us 2.3040us [CUDA memcpy DtoD]
0.94% 2.2080us 1 2.2080us 2.2080us 2.2080us [CUDA memset]
Put the tensorflow gpu-kernel calls here:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 24.05% 9.5050us 4 2.3760us 2.0800us 2.9120us [CUDA memcpy DtoH]
19.59% 7.7440us 7 1.1060us 896ns 1.6640us [CUDA memcpy HtoD]
15.87% 6.2730us 1 6.2730us 6.2730us 6.2730us void tensorflow::functor::_GLOBAL__N__62_tmpxft_00006b62_00000000_11_lstm_ops_gpu_cu_compute_70_cpp1_ii_14ba60a9::lstm_gates<float, bool=0>(float const , float const , float const , float const , float const , float const , tensorflow::functor::_GLOBAL__N__62_tmpxft_00006b62_00000000_11_lstm_ops_gpu_cu_compute_70_cpp1_ii_14ba60a9::lstm_gates<float, bool=0>, float const *, float const *, float const *, float const *, float const *, float const , float, float, int, int)
15.22% 6.0160us 1 6.0160us 6.0160us 6.0160us void gemv2N_kernel_val<float, float, float, int=128, int=1, int=4, int=4, int=1>(float, float, cublasGemv2Params_v2<float, float, float>)
11.01% 4.3520us 2 2.1760us 1.6960us 2.6560us [CUDA memcpy DtoD]
8.58% 3.3920us 1 3.3920us 3.3920us 3.3920us void tensorflow::functor::_GLOBAL__N__62_tmpxft_00006b62_00000000_11_lstm_ops_gpu_cu_compute_70_cpp1_ii_14ba60a9::concat_xh(float, tensorflow::functor::_GLOBAL__N__62_tmpxft_00006b62_00000000_11_lstm_ops_gpu_cu_compute_70_cpp1_ii_14ba60a9::concat_xh const *, tensorflow::functor::_GLOBAL__N__62_tmpxft_00006b62_00000000_11_lstm_ops_gpu_cu_compute_70_cpp1_ii_14ba60a9::concat_xh const , int, int, int)
Try adding
with nnvm.compiler.build_config(opt_level=3):
before this line