Performance regression (?) : relay.build with LLVM/CPU

As part of addressing the warning : DeprecationWarning: legacy graph runtime behaviour of producing json / lib / params will be removed in the next release

I’ve noticed that there seems to be a performance regression introduced as a result of using the new relay.build.

Given the following trivial example :

import os
import numpy as np
import tvm
from PIL import Image
from tvm import te
from tvm.contrib import graph_runtime
from tvm import relay
from tvm.runtime import container
from tvm.runtime import vm as vm_rt
from tvm.relay import testing
from tvm.relay import vm
from tvm.contrib.download import download_testdata
from util import load_test_image

model_dir ="./mnasnet_1.3_224/"
tflite_model_file = os.path.join(model_dir, "mnasnet_1.3_224.tflite")
tflite_model_buf = open(tflite_model_file, "rb").read()

# Get TFLite model from buffer
try:
    import tflite
    tflite_model = tflite.Model.GetRootAsModel(tflite_model_buf, 0)
except AttributeError:
    import tflite.Model
    tflite_model = tflite.Model.Model.GetRootAsModel(tflite_model_buf, 0)

dtype="float32"
width=224
height=224
image_data = load_test_image(dtype, width, height)

input_tensor = "input"
input_shape = (1, 224, 224, 3)
input_dtype = "float32"

mod, params = relay.frontend.from_tflite(tflite_model,
                                         shape_dict={input_tensor: input_shape},
                                         dtype_dict={input_tensor: input_dtype})

target = "llvm -mattr=+neon"
tvm_targets = tvm.target.create(target)
cpu_target = "llvm"
target_host=cpu_target

cpudevice = tvm.runtime.cpu()

ctx = tvm.runtime.context("cpu")
with relay.build_config(opt_level=3):
    graph, lib, params = relay.build(mod, target, params=params)

module = graph_runtime.create(graph, lib, tvm.cpu())
module.set_input(input_tensor, tvm.nd.array(image_data))
module.set_input(**params)

ftimer = module.module.time_evaluator("run", ctx, number=1, repeat=10)
prof_res = np.array(ftimer().results) * 1000  # multiply 1000 for converting to millisecond
print("%-20s %-19s (%s)" % ("mnasnet_1.3_224.tflite", "%.2f ms" % np.mean(prof_res), "%.2f ms" % np.std(prof_res)))

This yields ~1200ms with a 15ms std deviation on certain arm64 hardware.

If we just change the relay.build call etc just prior to the graph_runtime.create call to the following:

with tvm.transform.PassContext(opt_level=3):
    graph_mod = relay.build(mod, tvm_targets, params=params,target_host=target_host)

lib = graph_mod.get_lib()
params = graph_mod.get_params()
graph = graph_mod.get_json()

module = graph_runtime.create(graph, lib, tvm.cpu())

time increases to ~3200ms with a 17ms std deviation.

Is PassContext properly constructed? Is this the right way to call relay.build?

If it is, then seems like some deeper digging is in order.

Thanks.

I am also working with this deprecation warning, but I did not find any doc about the new API. Can anyone please show me how to do the migration?

cc @FrozenGene about the API migration guide.

@tgall_foo can you explore a bit in terms of variants? In particular, it would be nice if you can try out other ways to pass in relay.build without changing other parts of the code

https://github.com/apache/incubator-tvm/pull/6482 is a PR that adds example about the migration

@roastduck

As a rough example, given something like:

target = "llvm -device=arm_cpu -mcpu=thunderxt88 -mtriple=aarch64-unknown-linux-gnu mattr=+neon,+crc,+lse"
tvm_targets = tvm.target.create(target)
cpu_target = "llvm"
target_host=cpu_target

then the following changes from:

with tvm.transform.PassContext(opt_level=3):
    graph, lib, params = relay.build(mod, target, params=params)

to

with tvm.transform.PassContext(opt_level=3):
    graph_mod = relay.build(mod, tvm_targets, params=params,target_host=target_host)

lib = graph_mod.get_lib()
params = graph_mod.get_params()
graph = graph_mod.get_json()

@tqchen - yes i’ll do so. I’ve noticed it’s more pronounced with mnasnet least on arm64. Something with more substance and data to follow.

1 Like

I also want to note that the new module-based runtime will encapsulate weights and graph, so we can do

  import tvm
  from tvm import relay
  from tvm.contrib import graph_runtime

  def compile_and_run():
        # build the library using graph runtime
        lib = relay.build(...)
        lib.export_library("compiled_lib.so")
        # load it back as a runtime
        lib:tvm.runtime.Module = tvm.runtime.load_module("compiled_lib.so")
        # Call the library factory function for default and create
        # a new runtime.Module, wrap with graph module.
        gmod = graph_runtime.GraphModule(lib["default"](ctx))
        # use the gmod
        gmod.set_input("x", data)
        gmod.run()

For cases like uTVM, we might still need to get components out, where the member function that @tgall_foo pointed out would be useful.

Thank @tqchen and @tgall_foo for explanation. The PR diff is very helpful for me, and maybe we can keep it as part of a migration guide.

I’ve started to dive into this today, I’ve run through a number of models from Google’s model zoo (https://www.tensorflow.org/lite/guide/hosted_models) for Image classification both quantized and fp32 to try and trip across other places where there might be an obvious performance regressions.

MobileNet* - no performance regressions observed (on arm64) mnsasnet-1.3 - (cpu with llvm 10) on intel, does not show the same regression noted on arm64 - performance regression still observed with on arm64.

I’ll do a little more digging tomorrow across squeeze net and inception but given I am not able reproduce this on intel, I’m less concerned then I was. I’ll also check in with 32 bit arm.

Stay tuned.

2 Likes