Is that possible to implement persist rnn kernel using tvm schedule?
checkout https://github.com/dmlc/tvm/tree/master/topi/recipe/rnn for some reference
Thanx,we will try. How about the performance of this persist rnn implementation compared with baidu or cudnn version?
@tqchen We have compared the performance with baidu implementation of prnn. This version is about 2 times slower than baidu’s version. We compared the code, and find baidu use pipeline to hide the memory fetch in one batch, however this is not done in tvm’s prnn. According to the tvm paper, it seems we can use vthread to hide the memory latency, but all the tutorials using vthread are about avoiding share memory bank conflict by vthread which makes us a little confused. We want to know how to use vthread to hide memory fetch for cuda, and is there any tutorial about that? thx
nice observation. I am not very sure if the vthread latency hiding applies to this case. However, we might be able to benefit from double buffering, there is an experimental double buffer schedule in TVM.
This might involve enhancing TVM’s scheduling primitive if you can come up with a simplified code that can serve basis for discussion on what code transformation is needed, we might be able to find something to add to tvm.
@tqchen
The pseudo code is like this:
# s[batch][seq_len]
load s[0][0]
for t in seq_len:
for i in batch:
prefetch(s[i+1][t]) # pipeline stage 1
compute s[i][t+1] using s[i][t] # pipeline stage 2
reduce_write(s[i][t+1]) # pipeline stage 3
It does use the pingpong buffer to overlap compute and prefetch. Is that possible to realize this using tvm primitive?
One related primitive in tvm is the double_buffer, which does a prefetch rewrite, so something on that direction might be useful.
I am following up this thread to see if there is anything I could do to improve RNN performance.
@feiyulv So the reason TVM is ~2 times slower here is software pipelining, because currently TVM doesn’t parallelize pipeline stage 1
with pipeline stage 2 + pipeline stage 3
. And this issue could be resolved using a ping pong buffer as @tqchen mentioned.
Have I got that right?
@feiyulv Notably, in your code snippet, each sample is processed separately. So does Baidu’s implementation prefer like a un-batched way of processing sequences?