[VM] The performance degradation of VM runtime and Dynamic Shape support compared to Graph Runtime

In order to see the performance difference of Graph Runtime and VM Runtime. We construct a simple network with three layers of dense+bias structures. The dimensions are 1024-512-256-128.

We construct three cases:

  1. using Graph Runtime, the input batch size is fixed at compilation. “Graph Runtime”
  2. using VM Runtime, the input batch size is fixed at compilation. “VM Static”
  3. using VM Runtime, the input batch relies on “relay.Any()” to do the compilation, which could support dynamic batch size with only one compilation for different batch sizes. “VM Dynamic”

We measure the inference run time in unit of “ms” . The results are as following:

We found that the for the fixed batch size case, the VM runtime is slower than Graph Runtime and can be up to 2 times. We guess this comes from the additional execution of AllocStorage and AllocTensor in VM runtime.

For the dynamic batch size case, the VM runtime is slower than the VM runtime with fixed batch size, which can be up to 4.5 times. We found that, there are many additional instructions for calculating the tensor shape in dynamic input size case. For example, the number of VM instructions with static batch size is only 22, while the number of VM instructions to support dynamic batch size is 85!!

So, it seems that compared with GraphRuntime, VM runtime has some performance degradation. And with dynamic input size from “relay.Any()”, the performance degradation is even larger due to the calculation for the tensor shape. What to think of such a performance degradation in VM runtime and dynamic shape case? Is there any possible future plan to further increase the efficiency of the VM runtime, especially for the dynamic shape support?

Thank you so much!

1 Like

Are you running this on GPU or CPU? The performance degradation is expected on GPU as we need the heterogenous runtime support to avoid redundant memory copy between CPU and GPU. @zhiics is currently working on this.

Besides, @jroesch is working on the memory planning for dynamic shape cases to reduce the total number of memory allocations and reuse the buffer as much as possible.

Hi there, on top of answering Haichen’s questions can you share your example programs that you used to generate this data? we actively improving and working on VM to close the parity gap with the graph runtime.

Thank you for the response! I try it using the cpu backend and target with “llvm”.

Thank you for the response! Yeah, I can share the code: Following is the code for evaluate the Graph Runtime performance:

def evaluate_graph_runtime(batch_size):
    mod, params, data_shape, out_shape = get_net(batch_size)
    with relay.build_config(opt_level=3):
        graph, lib, params = relay.build(mod, target=target, target_host = target, params=params)
    from tvm.contrib import graph_runtime
    m = graph_runtime.create(graph, lib, ctx)
    m.set_input(**params)
    input_shape = (batch_size, 1024)
    data_tvm = (np.random.uniform(size=input_shape)).astype(dtype)
    m.set_input('data', data_tvm)
    m.run()
    start_time = datetime.datetime.now()
    for i in range (100):
        m.set_input('data', data_tvm)
        m.run()
    end_time = datetime.datetime.now()
    tvm_time = end_time - start_time

Following is the code for evaluate the VM Runtime performance with “isany” to specify the dynamic shape or not:

def evaluate_vm_runtime(batch_size, isany=False):
    batch_size_ = batch_size
    if isany:
        batch_size_ = relay.Any()
    mod, params, data_shape, out_shape = get_net(batch_size_)
    exe = compile(mod, target="llvm", params=params)
    vm = vm_rt.VirtualMachine(exe)
    vm.init(ctx)
    input_shape = (batch_size, 1024)
    data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
    input_list = [data_tvm]
    vm.run(input_list)
    start_time = datetime.datetime.now()
    for i in range(100):
        vm.run(input_list)
    end_time = datetime.datetime.now()
    tvm_time = end_time - start_time

To get the network, using the following code:

def get_net(batch_size):
    input_shape = (batch_size, 1024)
    output_shape = (batch_size, 128)
    data = relay.var("data", shape=input_shape, dtype=dtype)
    dense0 = relay.testing.layers.dense_add_bias(data=data, units=512, name='fc0')
    dense1 = relay.testing.layers.dense_add_bias(data=dense0, units=256, name='fc1')
    dense2 = relay.testing.layers.dense_add_bias(data=dense1, units=128, name='fc2')
    func = relay.Function(relay.analysis.free_vars(dense2), dense2)
    mod, params = create_workload(func)
    return mod, params, input_shape, output_shape

Thank you!

@lfengad Have you figured out the reason?

have you tried some pre-trained model instead of a model created from scratch using relay which has dynamic shape input, how is that performance using vm runtime?