How to run fp32 model in fp16 mode?

A_newer · November 17, 2022, 3:43am

I want to read a local fp32 .tflite model and run it in fp16 mode. How to achieve that?

elvin-n · November 17, 2022, 5:44am

you need to use ToMixedPrecision transformation before compilation of the model

For example:

        with tvm.transform.PassContext(
            config={"relay.ToMixedPrecision.keep_orig_output_dtype": True}
        ):
            mod = ToMixedPrecision("float16")(mod)

And target hardware definitely should support fp16 operations, That is a case of GPU and latest ARM cpu, for example, but not x86 processors yet

And one more note - the algorithm of transformation is based on three types of operations - ops are olways converted, ops which can be converted if neighbor op are converted and ops which never will be converted. The list of ops can be seen here I am mentioned just you would not be surprised why some ops might not be converted.

You can modify the list but you will have to verify accuracy of the network additionally. That is a requirement in any case, but for enumerated lists of ops accuracy drop is usually negligible.

A_newer · November 17, 2022, 6:42am

Thanks for suggestion. I have tried this function but the benchmark result shows that it was getting worse.

elvin-n · November 17, 2022, 6:43am

What is target hardware?

Have you tuned network anyhow for fp32 and fp16?

A_newer · November 17, 2022, 6:58am

I tried on Dimensity 9000, both for it’s cpu and gpu, which is mali.
I have tuned a network in fp32, and it can’t be transfer to fp16 in this way.

elvin-n · November 17, 2022, 8:29am

Yes, you have to re-tune model been converted into fp16. Old tuning statistics will be ignored. And only after this we can compare performance and can claim about problems.

A_newer · November 17, 2022, 8:42am

Thanks, I will take a try on tuning fp16.
And before that, I want to run a fp32 network in both fp32 and fp16 without tuning and make a comparison.

elvin-n · November 17, 2022, 9:19am

Yes, non tuned vs non tunes is also proper comparison. On the same time the default configuration of kernels is a result of schedule author’s analysis that happened on certain device that most likely will not fit to another model even with the same architecture. I.e. it should be compared but I would not do much conclusions based on these results.

And I recommend to tune rather more than less trials especially in case of AutoTVM when we cannot estimate the tuning efficiency. In case of AutoScheduler you always can take a look into total_latency.tsv and see if performance continue improve or you get to some flat state.

A_newer · November 25, 2022, 7:27am

I followed your instruction to tune a mobilenet on gpu, which make a big progress in benchmark. Thanks a lot.
But I can’t restore such a result on mobile cpu. After I add ToMixedPrecision for my model, it gets slower compared with fp32 during and after tuning process.

elvin-n · November 25, 2022, 7:37am

What target do you use for compilation and tuning?

A_newer · November 25, 2022, 7:48am

llvm -mtriple=aarch64-linux-gnu -mattr=+neon

elvin-n · November 25, 2022, 7:52am

Augh, I have never verified what happens on ARM cpu for fp16. I was mostly concentrated on graphics. Will try to verify

elenkalda-arm · November 25, 2022, 9:52am

Adding +fullfp16 might help (but also might not). There was a study done on FP16 on the CPUs and it seemed like some networks saw improved performance, some became slower (the slowness was due to ToMixedPrecision introducing extra casts, you can also try running FoldConstant after ToMixedPrecision to get rid of some of them). Also, I think some of the slowness was due to some schedules not being particularly vectorization friendly… That was all without any tuning though. cc @ashutosh-arm who knows more

A_newer · November 28, 2022, 1:51am

Sorry, what’s fullfp16 and where to add? Thanks.

A_newer · November 28, 2022, 2:18am

Thanks a lot. Now I have another error when applying MixPrecision for mali gpu.
Error log is like:

File “/home/tvm/tvm/python/tvm/auto_scheduler/measure.py”, line 1150,
in _rpc_run func.entry_func(*loc_args)
File “/home/tvm/tvm/python/tvm/_ffi/_ctypes/packed_func.py”, line 237, in call rai
…
execution of TVM. For more information, please see: Handle TVM Errors — tvm 0.11.dev0 documentation ---------------------------------------------------------------
Check failed: ret == 0 (-1 vs. 0) :
TVMError: Cannot handle float16 as device function argument , all_cost:2.11, Tstamp:1669419374.61)

Besides that, the layer named vm_mod_fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_1 has no performance static. In the schedule table, it’s like

| 4 | vm_mod_fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_1 | - | - | 5120 |
| 5 | vm_mod_fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_2 | - | - | 8640 |
| 6 | vm_mod_fused_nn_conv2d_add_add | 0.876 | 213.75 | 64 |

As you can see, the 6th layer conv2d has it’s Laytency and Speed data, but not for 4th and 5th layer.

elenkalda-arm · November 28, 2022, 9:46am

Sorry - it should be added to the -mattr. It enables the FP16 support on the processor (in case it wasn’t already enabled).

A_newer · November 28, 2022, 10:06am

Thanks, I will try and reply soon.

A_newer · December 2, 2022, 9:21am

I try fullfp16+MixPrecision on resnet50. But it gets slower on neon cpu after tuning.

ashutosh-arm · December 21, 2022, 5:55pm

Just for reference adding link to the mattr in use in a sample test: https://github.com/apache/tvm/blob/36d89a28fb984caa83082b034c46180a82dcd1ea/tests/python/integration/test_arm_aprofile.py#L28