Could tvm use fp16 to infer?

wangbingjia · December 1, 2021, 10:13am

My target platform is cuda.

I would like to ask, Does tvm support converting fp32 precision models to fp16 and then use fp16 for auto tune or auto schdule?? On the other hand, I would like to ask, does tvm currently support using tensorcore for fp16 model infer and auto tune?

Thanks your reply!!!

comaniac · December 1, 2021, 5:48pm

TVM should be able to do so. At least auto-tuning and backend lowering, compilation are supported. You can try to convert a FP16 model directly to Relay to see if that works.

Meanwhile, @AndrewZhaoLuo introduced a pass that performs auto-casting to the ops that can benefit from FP16 execution. You can check out this PR:

github.com/apache/tvm

[Relay] [Pass] Add mixed precision (e.g. FP16) model conversion pass

apache:main ← AndrewZhaoLuo:andrewluo-add-fp16-conversion-pass

opened 05:41PM - 18 May 21 UTC

AndrewZhaoLuo

+1189 -21

This implements a pass to convert an fp32 relay graph into an fp16 version. The …RFC is described [here](https://discuss.tvm.apache.org/t/rfc-relay-fp32-fp16-model-support/9994/). ### Changes: - Add pass and unittests - Make DataType hashable via std::hash (it is kind of slow though). (include/tvm/runtime/data_type.h) - Fix a bug with conv2d not working with accumulation_dtypes different than inputs (python/tvm/topi/nn/conv2d.py) ### Testing Some unittests. **Models tested (onnx):** Image Classification - Densenet (https://github.com/onnx/models) - Resnet18 (https://github.com/onnx/models) - EfficientNet-4 (https://github.com/onnx/models) - InceptionV3 (Converted from torchvision.models) Object Detection - Tinyyolov2 (https://github.com/onnx/models) - Yolov2 (https://github.com/onnx/models) - Yolov4 (https://github.com/onnx/models) - SSD-resnet34 (https://github.com/onnx/models) Embedding Models - Arcface (https://github.com/onnx/models) - RFB-face (https://github.com/onnx/models) Super resolution: - Some random super resolution model (https://github.com/onnx/models) NLP: - GPT-2 (https://github.com/onnx/models) - DistillBert (Converted from HuggingFace's transformers lib) **Models tested (relay native):** - ResNet18 - ResNet18-3D - Densenet - LSTM (unrolled) - Squeezenet - Mobilenet By tested I mean I confirm it did some transformation on the graph and a forward pass could be run on CPU and matches the fp32 output somewhat. I have nothing on performance metrics or other devices yet. ### Future PRs (in order of priority) 1. Show this actually leads to speedups! 2. Make the coloring function and output_dtype/accumulation_dtype functions extensible via Python 3. An extensive audit of existing relay ops into the coloring lists. 4. Write a pass to fold unnecessary casts e.g. cast(fp16) --> cast(fp32) --> cast(fp16) can probably just be one cast(fp16) 5. Rename the colors into something less generic and easily confused for something else 6. Rewrite the signature of functions automatically (right now everything is kept in fp32 and internal to the function things are cast) ### Known issues - Right now the pass will mutate nodes in the original relay graph cc @mbrookhart , @csullivan please take a look and add relevant reviewers ### Speedups (add as I go along) BERT w/ input shape [1, 128] on M1 Mac (based on https://github.com/octoml/Apple-M1-BERT) and 10000 tuning trials: FP32 version - Mean inference time (std dev): 107.82 ms (3.39 ms) FP16 version - Mean inference time (std dev): 80.04 ms (6.19 ms) ~25% speedup! Yolov2 (https://github.com/onnx/models) w/ 10000 tuning trials on M1 Mac FP32 version - Mean inference time (std dev): 112.21 (3.75 ms) FP16 version - Mean inference time (std dev): 71.05 ms (4.04 ms) ~36% speedup!

wangbingjia · December 1, 2021, 6:29pm

And I see Anedrewzhaoluo,his work doesn’t use GPU tensorcore???

wangbingjia · December 1, 2021, 6:30pm

Thanks your reply,it works to me.And I want to ask that can tvm convert fp32 model to fp16 model?if can,does it use quantize to convert?

AndrewZhaoLuo · December 1, 2021, 7:51pm

Hey wangbingjia, tvm can convert the model from fp32 → fp16 as you have seen. I am not sure what you mean by “use quantize to convert” but it also quantizes the weights and such of the model (or it should).

In general you need to apply some more optimizations to clean up the graph after FP16 quantization: https://github.com/AndrewZhaoLuo/TVM-Sandbox/blob/f1f9f698be2b7a8cc5bcf1167d892cd915eb7ce7/fp16_pass/benchmark_fp16.py#L19

As for tensorcore support, I believe autoscheduler does not support right now and support in autotvm is inconsistent depending on workload. @masahi and @junrushao might know more about this though.

I will say I have done some matrix heavy workloads in autotvm and gotten 5x speedup so its probably hitting the tensorcores there.

hjiang · December 1, 2021, 9:31pm

thanks @wangbingjia @comaniac @AndrewZhaoLuo for post, I also have interest for this model convert topic, I have a similar question like what @wangbingjia raised, the question is what is the main different between “relay.quantize” and “tvm.relay.transform.ToMixedPrecision” beside of “relay.quantize” did a float32 to int8 convert and “ToMixedPrecision” did a work to convert from float32 to float16 by default?

second question is about the a unified interface, some other framework like tflite which provide following unified model convert interface like following,

#to int8
converter.inference_input_type = tf.int8  # or tf.uint8
converter.inference_output_type = tf.int8  # or tf.uint8
tflite_quant_model = converter.convert()

#to fp16
converter.target_spec.supported_types = [tf.float16]
tflite_quant_model = converter.convert()

and current tvm used couple different interface mentioned before to provide the model convert function, is that helpful if these different interface/api can get wrapped with a single interface for example a function call “convert” or “quantize”?

wangbingjia · December 2, 2021, 2:06am

Yep,thanks your reply.My graduation thesis is about TVM, and I am also very interested in TVM.About tensorcore question, I will try as you say.

Q1:As I know, if I want to convert fp32 model to fp16 model in tvm, there are two ways,one is use " tvm.relay.transform.ToMixedPrecision", another way is use “relay.quantize.qconfig”.I don’t know if what I said is correct.

Q2:And after I use the TVM interface to reduce the model accuracy to int8, the inference speed is reduced by more than 50 times.is this normal?？ I am trying to locate this problem. Or later I will try to use the quantified model to optimize.

Q3:As @hjiang say,that is my question too.

AndrewZhaoLuo · December 2, 2021, 2:41am

Q1. relay.quantize.qconfig has to do with int8 quantization I believe. Integers in this space map back to the real numbers via an affine transforms stored in the qconfigs. This is not relevant to FP16. FP16 uses floating point format still, it just has less bits in the mantissa and exponent. If you want to convert to fp16 you should use ToMixedPrecision only. FP16 quantization and Integer quantizations are quite different.

Q2. There are a lot of variables here. In general if you are using the autoscheduler to schedule, you might see speed-downs since I believe it lacks support for using hardware intrinsics like vectorization with integers. It’s hard to add too which is why no one has done it yet. With autotvm, some operators support int8 quantization well (e.g. they use hardware intrinsics well) and others do not.

In short, the answer is complicated and depends on the model and how you tuned/ran the model. 5x speed downs I have seen in the past due to gaps in integer support on the scheduling level. FP16 quantization is very good if you have hardware which supports it well (e.g. a new enough ARM (ISA v8.2+), a GPU, something opencl supports with FP16 intrinsics)

Q3. @hjiang we do not have a unified interface. Instead use ToMixedPrecision for lower-bit floating point quantization (e.g. fp16 or bfloat16). It might be an interesting idea to add, though our int8 automatic quantization needs a bit of work first. IMO though fp16 and integer quantization are quite different and have different needs (e.g. integer you should have a calibrating dataset while fp16 you don’t necessarily need to)

wangbingjia · December 2, 2021, 2:46am

Thank you very much for your reply! your reply has answered my confusion, I will try to use the way you said.

wangbingjia · December 2, 2021, 3:52am

Hi, @AndrewZhaoLuo ,When I use “ToMixedPrecision”, I meet this error, could you tell me why?

AndrewZhaoLuo · December 2, 2021, 7:14pm

If you share your model I can help debug, this appears to be an operator which did not get converted correctly.

wangbingjia · December 3, 2021, 1:12am

Yes, of course.I think it might be because of “InstanceNorm2d”, when I remove “InstanceNorm2d”, it works.After using fp16 precision, it is much faster than directly using “relay.quantize” to convert to int8, although they are not as fast as the original fp32.

import torch
import torch.nn as nn
from collections import namedtuple
import math
import torch.utils.model_zoo as model_zoo

__all__ = ['ResNet_IBN', 'resnet50_ibn_a']

model_urls = {
    'resnet50': 'https://download.pytorch.org/models/resnet50-19c8e357.pth',
    'resnet101': 'https://download.pytorch.org/models/resnet101-5d3b4d8f.pth',
    'resnet152': 'https://download.pytorch.org/models/resnet152-b121ed2d.pth',
}

def conv3x3(in_planes, out_planes, stride=1):
    """3x3 convolution with padding"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
                     padding=1, bias=False)

def conv1x1(in_planes, out_planes, stride=1):
    """1x1 convolution"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)

class IBN(nn.Module):
    def __init__(self, planes):
        super(IBN, self).__init__()
        half1 = int(planes / 2)
        self.half = half1
        half2 = planes - half1
        self.IN = nn.InstanceNorm2d(half1, affine=True)
        self.BN = nn.BatchNorm2d(half2)

    def forward(self, x):
        split = torch.split(x, self.half, 1)
        out1 = self.IN(split[0].contiguous())
        out2 = self.BN(split[1].contiguous())
        out = torch.cat((out1, out2), 1)
        return out


class Bottleneck_IBN(nn.Module):
    expansion = 4

    def __init__(self, inplanes, planes, ibn=False, stride=1, downsample=None):
        super(Bottleneck_IBN, self).__init__()
        self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=1, bias=False)
        if ibn:
            self.bn1 = IBN(planes)
        else:
            self.bn1 = nn.BatchNorm2d(planes)
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)
        self.conv3 = nn.Conv2d(planes, planes * self.expansion, kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm2d(planes * self.expansion)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        residual = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        if self.downsample is not None:
            residual = self.downsample(x)

        out += residual
        out = self.relu(out)

        return out


class ResNet_IBN(nn.Module):

    def __init__(self, last_stride, block, layers, frozen_stages=-1, num_classes=1000):
        scale = 64
        self.inplanes = scale
        super(ResNet_IBN, self).__init__()
        self.conv1 = nn.Conv2d(3, scale, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1 = nn.BatchNorm2d(scale)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.frozen_stages = frozen_stages
        self.layer1 = self._make_layer(block, scale, layers[0])
        self.layer2 = self._make_layer(block, scale * 2, layers[1], stride=2)
        self.layer3 = self._make_layer(block, scale * 4, layers[2], stride=2)
        self.layer4 = self._make_layer(block, scale * 8, layers[3], stride=last_stride)
        self.avgpool = nn.AvgPool2d(7)
        self.fc = nn.Linear(scale * 8 * block.expansion, num_classes)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2. / n))
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()
            elif isinstance(m, nn.InstanceNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()

    def _make_layer(self, block, planes, blocks, stride=1):
        downsample = None
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(nn.Conv2d(self.inplanes, planes * block.expansion, kernel_size=1, stride=stride, bias=False), nn.BatchNorm2d(planes * block.expansion),)

        layers = []
        ibn = True
        if planes == 512:
            ibn = False
        layers.append(block(self.inplanes, planes, ibn, stride, downsample))
        self.inplanes = planes * block.expansion
        for i in range(1, blocks):
            layers.append(block(self.inplanes, planes, ibn))

        return nn.Sequential(*layers)

    def _freeze_stages(self):
        if self.frozen_stages >= 0:
            self.bn1.eval()
            for m in [self.conv1, self.bn1]:
                for param in m.parameters():
                    param.requires_grad = False

        for i in range(1, self.frozen_stages + 1):
            m = getattr(self, 'layer{}'.format(i))
            print('layer{}'.format(i))
            m.eval()
            for param in m.parameters():
                param.requires_grad = False

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x3 = x
        x = self.layer4(x)

        return x, x3
        # return x

    # def load_param(self, model_path):
    def load_param(self, model_path='E:/model/resnet50_ibn_a.pth'):
        param_dict = torch.load(model_path)
        print(param_dict)
        print('*'*60)
        if 'state_dict' in param_dict:
            param_dict = param_dict['state_dict']
        for i in param_dict:
            if 'fc' in i:
                continue
            self.state_dict()[i.replace('module.', '')].copy_(param_dict[i])


ArchCfg = namedtuple('ArchCfg', ['block', 'layers'])
arch_dict = {
    #'resnet18': ArchCfg(BasicBlock, [2, 2, 2, 2]),
    #'resnet34': ArchCfg(BasicBlock, [3, 4, 6, 3]),
    'resnet50': ArchCfg(Bottleneck_IBN, [3, 4, 6, 3]),
    'resnet101': ArchCfg(Bottleneck_IBN, [3, 4, 23, 3]),
    'resnet152': ArchCfg(Bottleneck_IBN, [3, 8, 36, 3]),}

def resnet50_ibn_a(last_stride=1, pretrained=False, **kwargs):
    """Constructs a ResNet-50 model.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
    """
    model = ResNet_IBN(last_stride, Bottleneck_IBN, [3, 4, 6, 3], **kwargs)
    block_dict = dict()
    if pretrained:
        state_dict = torch.load('E:/model/resnet50_ibn_a.pth')
        print('Load pretrained model from ===> E:/model/resnet50_ibn_a.pth')
        model.load_param('E:/model/resnet50_ibn_a.pth')
        # print(state_dict.items())
        for k, v in state_dict.items():
            # print(k, v)
            if 'layer4.' in k:
                block_dict.update({k: v})
    return model

#def get_resnet50_org():
 #   model = ResNet_IBN(last_stride=1, arch_dict['resnet50'].block, arch_dict['resnet50'].layers)
  #  return model

# if __name__ == '__main__':
#     import torch
#     device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # PyTorch v0.4.0
#
#     model = resnet50_ibn_a(pretrained=False)
#     input = torch.randn(1, 3, 384, 128)
#     out1, out2 = model(input)
#     print(out1.shape)
#     print(out2.shape)
#     print('&'*80)
#     # print(y.shape)
#     # print(x3.shape)

AndrewZhaoLuo · December 6, 2021, 8:26pm

I’ll take a closer look this week. As for speed, which type of device are you running this on? Not all targets, most notably x86 CPU has good support for fp16.

lizexu · May 16, 2022, 11:57am

Do you have a specific demo of fp16 reasoning, the example in cuda? Can you learn from it

AndrewZhaoLuo · May 16, 2022, 5:22pm

This entire repo should be helpful though does not use CUDA. You should be able to adapt this to run on CUDA easily: https://github.com/AndrewZhaoLuo/CenterFaceTVMDemo.

An issue IIRC with CUDA is our tensorcore support is a bit fuzzy at this time so you might not see as high speedups as with tensorrt for example.

lizexu · May 17, 2022, 1:48am

Thanks, I’ll take a look

twmht · July 24, 2023, 2:25am

@AndrewZhaoLuo

Does TVM support TensorCore with AutoScheduler now?

yzh119 · July 24, 2023, 8:13am

MetaSchedule support Tensor Cores.

twmht · July 24, 2023, 8:39am

@yzh119

thank you. I have seen MetaScheduler today, is there any document for about how to use MetaScheduler?

yzh119 · July 24, 2023, 9:01am

Unfortunately, the documentation for metaschedule is poor.

There is an API reference doc: tvm.meta_schedule — tvm 0.14.dev0 documentation but I don’t think it is informative enough for new users.

The unit tests might help you find some examples of using meta-schedule, for example:

github.com

apache/tvm/blob/684689e9247630c5cce66363b9f877284d94523a/tests/python/unittest/test_meta_schedule_tune_tir.py

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
# pylint: disable=missing-docstring,no-member,invalid-name,unused-variable
import logging
import tempfile

This file has been truncated. show original

Besides, this paper might help you understand the design of Metaschedule.