I generated a cuda code with TVM, and measured the operation time by time_evaluator. The result is much faster than the cuda code generated by tvm merged into tensorflow.
Is this phenomenon normal ??
I generated a cuda code with TVM, and measured the operation time by time_evaluator. The result is much faster than the cuda code generated by tvm merged into tensorflow.
Is this phenomenon normal ??
How do you integrate tvm generated code into tensorflow? How do you measure it?
In time_evaluator, the only difference is that we skip the first warm up run.
I think you can try nvprof also. In my test, the results of time_evaluator(number=400) are very close to nvprof.
the code generated as follows:
extern “C” global void minmt_batch_matmul_10_1_4096_1024_kernel0( float* restrict A, float* restrict B, float* restrict C) {
float C_local[1];
C_local[0] = 0.000000e+00f;
for (int k = 0; k < 4096; ++k) {
C_local[0] = (C_local[0] + (A[((((int)threadIdx.y) * 4096) + k)] * B[(((((int)blockIdx.x) * 8) + ((int)threadIdx.x)) + (k * 1024))]));
}
C[(((((int)blockIdx.x) * 8) + (((int)threadIdx.y) * 1024)) + ((int)threadIdx.x))] = C_local[0];
}
I created a new op and call this cuda code. and than I run the new op in tensorflow, also with the first run skiped. But the speed in tensorflow is much slower the the tvm shows.
But the
I think tensorflow has large overhead from other parts. You cannot test it directly.
You can use nvprof to show single kernel time cost.
Can you show some examples about other paths in tensorflow?
Do you measure it by a python for loop? Can you post your code for measurement?
If it is the case, and you only measure a single operator. Then the other overhead includes copying memory from cpu to gpu, building the graph, etc
A = tf.Variable(tf.random_normal(shape=[1,16,1,64], mean=0, stddev=1), name=‘A’)
B = tf.Variable(tf.random_normal(shape=[1,16,1,64], mean=0, stddev=1), name=‘B’)
C = tf.matmul(A,B, transpose_b=True )
tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
results = sess.run(C)
start = time.time()
for i in range(1000):
results = sess.run(C)
end = time.time()
print 'cost time:%fus' % ((end-start)*1000)
I run the code with the first run skiped.
You should use
nvprof python your_tvm_script.py
nvprof python your_tensorflow_script.py
Then nvprof will tell you the execution time of a single kernel without other overhead
ok, I will try. I reference this post https://www.leiphone.com/news/201803/gHG5G6cCXBrzxjlu.html to speed up our transformer inference.
Do you mean the time measured by tvm.time_evaluaor() is only the kernel time without any other overhead?
Almost. With some negligible overhead like shape check, dtype check. Did you try nvprof?
Yes, I have tried, only the kernel time measured by nvprof looks the same as the time measured by tvm.