In the 5G era, the network latency is ultra-low. An edge sever in the carrier network (e.g. AWS wavelength, MEC servers, etc) can be one of the targets to offload inference like AI accelerators in the device.
My team is working on a framework to offload inference to outside servers, and I wonder if we could implement it elegantly with TVM. I think of contributing features for that, but I’m not really sure if I’m on the right way. I’d like to hear comments before working on that.
Offload inference to edge server
I think of running an RPC server on the edge server and serving inference requests from edge devices.
+-------------+
| Edge server |
Inference | |
+------+ offloading | +------+ |
| Edge | -----------------> | rpc | |
|device| <----------------- |server| |
+------+ Results | +------+ |
| |
+-------------+
The device sends a model to the edge server first, and after that, sends inference requests. However, the expected usage of the rpc server is that it runs on the device, I think. Is this the abuse of the rpc server?
I also think it might be good if the rpc sever would have an option to load a model on the edge server like TensorFlow Serving.
Offload inference partially
When an input model has branches, offloading some of them to the edge sever may reduce the total inference time.
compute
on server
+---+
+-->| |---+
+-----+ +---+ | +---+ | +------+
|input|-->| |---+ +-->|output|
+-----+ +---+ | +---+ | +------+
+-->| |---+
+---+
compute
on device
I guess we could support this with the heterogeneous runtime feature. IIUC, the current TVM runtime doesn’t allow for mixing rpc and normal contexts, so it looks like we need a change for that.
Turn on offloading dynamically
Whether we need an edge offloading depends on the situation. It’s better to offload when the device is busy, but on the other hand, it’s not good when the network is unstable. I guess we can turn on offloading dynamically by adding another input to the model.
compute
on server
+-----+ +---+ X==1 +---+
|input|-->| |------->| |---+
+-----+ | | +---+ | +------+
| | +-->|output|
+-----+ | | X==0 +---+ | +------+
| X |-->| |------->| |---+
+-----+ +---+ +---+
compute
on device
The control input X
determines whether the computation is offloaded to the edge
server. I wonder if it is possible to prevent network transfer when X==0
in
the above example.