How could us use tvm.relay.transform.ToMixedPrecision?

coincheung · July 12, 2021, 12:15pm

Hi,

I need to inference my mdoel in fp16 mode, but I found no tutorial on this. I just found a function: tvm.relay.transform.ToMixedPrecision. Is this relevant to fp16 inference? How could we use it ?

comaniac · July 12, 2021, 4:41pm

This is a very new experimental feature so there’s no tutorial yet. You can refer to the unit tests in this PR for use cases: https://github.com/apache/tvm/pull/8069

cc @AndrewZhaoLuo

coincheung · July 13, 2021, 12:36am

Hi,

thanks for replying !! I come from github at this pr.

My code like this:

def compile_model(mod, params, target, logfile, save_path):
    tvm.relay.backend.compile_engine.get().clear()
    mod = tvm.relay.transform.ToMixedPrecision(
            mixed_precision_type='float16')(mod)
    with tvm.autotvm.apply_history_best(logfile):
        with tvm.transform.PassContext(opt_level=3):
            lib = tvm.relay.build(mod, target=target, params=params)
    lib.export_library(save_path) # 保存编译好的模型, 必须so结尾，不然c++不识别

And I have the error message:

Traceback (most recent call last):
  File "main.py", line 207, in <module>
    args.save_path)
  File "main.py", line 122, in compile_model
    mixed_precision_type='float16')(mod)
  File "/root/build/tvm/python/tvm/ir/transform.py", line 161, in __call__
    return _ffi_transform_api.RunPass(self, mod)
  File "/root/build/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  23: TVMFuncCall
  22: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::TypedPackedFunc<tvm::IRModule (tvm::transform::Pass, tvm::IRModule)>::AssignTypedLambda<tvm::transform::{lambda(tvm::transform::Pass, tvm::IRModule)#7}>(tvm::transform::{lambda(tvm::transform::Pass, tvm::IRModule)#7}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  21: tvm::transform::Pass::operator()(tvm::IRModule) const
  20: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  19: tvm::relay::transform::FunctionPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  18: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::TypedPackedFunc<tvm::relay::Function (tvm::relay::Function, tvm::IRModule, tvm::transform::PassContext)>::AssignTypedLambda<tvm::relay::transform::ToMixedPrecision(tvm::runtime::DataType, int)::{lambda(tvm::relay::Function, tvm::IRModule, tvm::transform::PassContext)#1}>(tvm::relay::transform::ToMixedPrecision(tvm::runtime::DataType, int)::{lambda(tvm::relay::Function, tvm::IRModule, tvm::transform::PassContext)#1})::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  17: tvm::relay::ToMixedPrecision(tvm::RelayExpr const&, tvm::runtime::DataType const&, int)
  16: tvm::relay::MixedModeMutator::VisitExpr(tvm::RelayExpr const&)
  15: tvm::relay::MixedModeMutator::VisitLeaf(tvm::RelayExpr const&)
  14: _ZN3tvm5relay16MixedModeMutator17DispatchVisitExprERKNS_9Re
  13: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  12: tvm::relay::ExprFunctor<tvm::RelayExpr (tvm::RelayExpr const&)>::VisitExpr(tvm::RelayExpr const&)
  11: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlR
  10: tvm::relay::MixedPrecisionPass::VisitExpr_(tvm::relay::FunctionNode const*)
  9: tvm::relay::ExprMutator::VisitExpr_(tvm::relay::FunctionNode const*)
  8: tvm::relay::MixedModeMutator::VisitExpr(tvm::RelayExpr const&)
  7: tvm::relay::MixedModeMutator::VisitLeaf(tvm::RelayExpr const&)
  6: _ZN3tvm5relay16MixedModeMutator17DispatchVisitExprERKNS_9Re
  5: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  4: tvm::relay::ExprFunctor<tvm::RelayExpr (tvm::RelayExpr const&)>::VisitExpr(tvm::RelayExpr const&)
  3: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlR
  2: tvm::relay::MixedModeMutator::VisitExpr_(tvm::relay::CallNode const*)
  1: tvm::relay::MixedPrecisionPass::Rewrite_(tvm::relay::CallNode const*, tvm::RelayExpr const&)
  0: tvm::Op::GetAttrMapContainer(tvm::runtime::String const&)
  File "/root/build/tvm/src/ir/../node/attr_registry.h", line 146
TVMError: Attribute 'FTVMMixedPrecisionConversionType' is not registered

I built tvm from source. commit id is: 972d7b52d2dbdd7cc1db98c3af04b04c4fc31b75.

How could I make it work please? @comaniac

comaniac · July 13, 2021, 12:52am

As I mentioned, 99% of the error like TVMError: Attribute 'XXX' is not registered is due to the mismatch between TVM Python code and C++ shared library. You need to re-run cmake and rebuild TVM to make sure everything is up to date.

coincheung · July 13, 2021, 4:27am

I still cannot make it work. This time, I used conda approach to install. By first: conda build --output-folder=conda/pkg conda/recipe, and then conda install tvm -c ./conda/pkg. This takes longer time than build with cmake. But I still have that error message. Would you please tell me can it be other reasons, and why other frequently-used class or functions (such as relay.build) works well ?

comaniac · July 13, 2021, 4:50pm

Sorry it turns out that this is an issue in the implementation. You need to explicitly import the mixed precision to register the attributes:

import tvm
from tvm import relay
from tvm.relay.transform import mixed_precision # Register op attribute
...
mod = tvm.relay.transform.ToMixedPrecision(mixed_precision_type='float16')(mod)

@AndrewZhaoLuo we should import this file in __init__.py so that it can be imported along with import tvm.relay.

AndrewZhaoLuo · July 13, 2021, 5:24pm

@coincheung Yes, sorry this is a bit of

You need a bit of a magic line to register the op attributes that are missing in the error you show:

github.com/AndrewZhaoLuo/TVM-Sandbox

fp16_pass/benchmark_fp16.py

3e949e86a


      
          import multiprocessing as mp
          from os import path
          
          import numpy as np
          import tvm
          from tvm import relay
          from tvm.driver import tvmc
          from tvm.driver.tvmc.model import TVMCModel
          from tvm.relay.transform import InferType, ToMixedPrecision, mixed_precision
          
          
          def load_model(name, **kwargs):
              return tvmc.load(path.join("./models", name), **kwargs)
          
          
          def graph_optimize(tvmc_model, run_fp16_pass, run_other_opts):
              mod, params = tvmc_model.mod, tvmc_model.params
              # Weird functions we don't use are in there it's weird
              mod = tvm.IRModule.from_expr(mod["main"])

from tvm.relay.transform import mixed_precision

Sorry about that. I will make the changes @comaniac suggest:

coincheung · July 14, 2021, 1:00am

Thanks a lot, it works now.

coincheung · July 14, 2021, 2:23am

Hi,

I tested fp16 mode on my platform, but I got unexpected performance.

I am using t4 gpu, and I exported my model from pytorch to onnx, and use tvm onnx frontend to load the model. Then I tuned my model on t4 gpu, with default configuration(xgboost tuner, 2000 trials).

After tuning, I compiled the model into two libs. One is with mod = ToMixedPrecision('float16')(mod), as above mentioned, and the other is without that line. The size of saved fp16 lib is about 27M, while the size of fp32 lib is 182M. However, when I run inference on these two libs for 1000 times on my gpu, the fp32 lib used 38s, but the fp16 lib used 110s.

Could you tell me is that normal for tvm? Shouldn’t fp16 mode runs faster than fp32?

Or do I need to add some options when I tuned my model to make fp16 faster than fp32 ?

AndrewZhaoLuo · July 14, 2021, 5:18pm

Yeah an increase in performance, especially for a tensor core gpu should be expected. Debugging why this isn’t the case is a bit harder. If you send me the model and script you are using I might have time to take a look.

Here are the current benchmarked models I have done: FP16 Improvements - Google Sheets

Furthermore you should be running other optimizations like constant folding to remove many cast operations on weights: TVM-Sandbox/benchmark_fp16.py at 3e949e86a8c2cd2a7af10d412243ef4f47da1109 · AndrewZhaoLuo/TVM-Sandbox · GitHub

I’m guessing you might not be doing that.

coincheung · July 15, 2021, 1:00am

Thanks for telling this, I will add this and have a try. Would you please tell me what is the meanings of run_other_opts ?

coincheung · July 15, 2021, 6:35am

Hi,

After adding the graph_optimization function , I found that I can make it work for cuda, but not make it work on llvm fp16 mode.

I load onnx model, compile it and fun inference with c++ api. A simplified piece of code is uploaded here: tvm_debug.zip - Google Drive

Juts run the script sh run.sh, would see the error message. Would you please spent some time helping me figure out the problem? Thanks a lot.

AndrewZhaoLuo · July 15, 2021, 5:48pm

It’s just shorthand for “run other optimizations”. When benchmarking or tuning I always run these. It’s free performance.

AndrewZhaoLuo · July 15, 2021, 7:10pm

Didn’t run your script but I believe the main problem isn’t with the mixed precision pass.

Even the FP32 model cannot run. This is the error I get:

E                   at /Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/type_traits:3545
E             File "/Users/andrewzhaoluo/Desktop/dev_tvm/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 81, in cfun
E                   at /Users/andrewzhaoluo/Desktop/dev_tvm/tvm/src/runtime/c_runtime_api.cc:525
E               rv = local_pyfunc(*pyargs)
E             File "/Users/andrewzhaoluo/Desktop/dev_tvm/tvm/python/tvm/relay/op/strategy/generic.py", line 243, in _compute_conv2d
E               return [topi_compute(*args)]
E             File "/Users/andrewzhaoluo/Desktop/dev_tvm/tvm/python/tvm/topi/x86/conv2d.py", line 129, in conv2d_nchw
E               packed_out = conv2d_NCHWc(data, kernel, strides, padding, dilation, layout, layout, out_dtype)
E             File "/Users/andrewzhaoluo/Desktop/dev_tvm/tvm/python/tvm/autotvm/task/topi_integration.py", line 165, in wrapper
E               node = topi_compute(cfg, *args)
E             File "/Users/andrewzhaoluo/Desktop/dev_tvm/tvm/python/tvm/topi/x86/conv2d.py", line 196, in conv2d_NCHWc
E               cfg.define_split(
E             File "/Users/andrewzhaoluo/Desktop/dev_tvm/tvm/python/tvm/autotvm/task/space.py", line 730, in define_split
E               return self._add_new_transform(SplitSpace, name, axes, policy, **kwargs)
E             File "/Users/andrewzhaoluo/Desktop/dev_tvm/tvm/python/tvm/autotvm/task/space.py", line 829, in _add_new_transform
E               axes = [x if isinstance(x, (VirtualAxis, Axis)) else self.axis(x) for x in axes]
E             File "/Users/andrewzhaoluo/Desktop/dev_tvm/tvm/python/tvm/autotvm/task/space.py", line 829, in <listcomp>
E               axes = [x if isinstance(x, (VirtualAxis, Axis)) else self.axis(x) for x in axes]
E             File "/Users/andrewzhaoluo/Desktop/dev_tvm/tvm/python/tvm/autotvm/task/space.py", line 687, in axis
E               return VirtualAxis(var)
E             File "/Users/andrewzhaoluo/Desktop/dev_tvm/tvm/python/tvm/autotvm/task/space.py", line 141, in __init__
E               raise RuntimeError("Invalid type of axis: " + str(type(var)))
E           RuntimeError: Invalid type of axis: <class 'tvm.tir.expr.Add'>

../tvm/python/tvm/_ffi/_ctypes/packed_func.py:237: TVMError

I believe this is because the conv2d compute definition for llvm does not support dynamic shaped inputs. You can make a github issue for this.

coincheung · July 15, 2021, 11:43pm

Hi,

Sorry that I did not describe my problem in detail.

I did not met problem of auto-tune(maybe because I used opt_level=3, I am not sure).

I got error message like this:


./segment: symbol lookup error: ./saved_model_v1_llvm_fp16.so: undefined symbol: __gnu_f2h_ieee
./segment: symbol lookup error: ./saved_model_v1_llvm_fp16.so: undefined symbol: __gnu_f2h_ieee
./segment: symbol lookup error: ./saved_model_v1_llvm_fp16.so: undefined symbol: __gnu_f2h_ieee
./segment: symbol lookup error: ./saved_model_v1_llvm_fp16.so: undefined symbol: __gnu_f2h_ieee
./segment: symbol lookup error: ./saved_model_v1_llvm_fp16.so: undefined symbol: __gnu_f2h_ieee
./segment: symbol lookup error: ./saved_model_v1_llvm_fp16.so: undefined symbol: __gnu_f2h_ieee
./segment: symbol lookup error: ./saved_model_v1_llvm_fp16.so: undefined symbol: __gnu_f2h_ieee
./segment: symbol lookup error: ./saved_model_v1_llvm_fp16.so: undefined symbol: __gnu_f2h_ieee
./segment: symbol lookup error: ./saved_model_v1_llvm_fp16.so: undefined symbol: __gnu_f2h_ieee

It happens when I load the llvm_fp16.so file with cpp api. I did not met this problem, when I load llvm_fp32.so, cuda_fp16.so or cuda_fp32.so file. What is the reason of this please?

AndrewZhaoLuo · July 16, 2021, 4:07pm

Try upgrading your version of llvm to the latest? Which version are you using?

It seems like this diff for llvm should have implemented things: https://reviews.llvm.org/rGf1ac850e7721a79238e213f2e6624bf8a2f3a7bd

Unfortunately I don’t have a lot of ideas beyond google searching.

coincheung · July 16, 2021, 11:59pm

Hi,

I am using llvm installed from apt of ubuntu 18.04, its version is llvm6.0:

By the way, I am running my program from docker container: nvidia/cuda:10.2-cudnn8-devel-ubuntu18.04. I think version 6.0 is higher than the version required in the docs, which is 4.0.

Would you please tell, which version are you using when you benchmarks the model?

coincheung · July 17, 2021, 12:29am

I updated my llvm with the following command:

 bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)

But the problem still exists.

AndrewZhaoLuo · July 17, 2021, 2:45pm

Did you also rebuild tvm with the new llvm? It sounds like a linking problem of some sort now. 6.0 should be enough though so I am not sure what happened.

coincheung · July 18, 2021, 3:40am

Yes, I rebuilt tvm from source, and also rebuilt my cpp code, though from result of ldd I observe that my cpp code does not rely on llvm.

Though the problem is still there, the number of error message lines is reduced from 5 to 3. Maybe I am nearer to the solution