Relay function rewrite with cuda via byoc

On NVIDIA server-class gpu, I am trying to improve graph rewrite for my model. After pattern matching, I can have a relay function to work with. I failed to replace the relay function with a customized relay op. With my limited experience with TVM, BYOC seems to be a more direct approach.

There seems 2 ways to implement the codegen. I am not sure which way to take here.

For codegen approach, I am not sure if this works with cuda. My guess is yes, since this is pretty much the same as the TVM codegen. Yet, just to verify this fact here. Also, I am wondering if there is an end-to-end hello-world like tutorial here?

For the jason approach, I want to get a feeling about the overhead versus the native TVM execution. Would the activation and output get copied every time the graph is run?