[Pytorch] Inferencing Bert Dense model for Question Answering

I have first converted a distilbert model finetuned on question answering model from transformers in to JIT compiled version. And I tried inferencing with that (JIT compiled model .pt format) without TVM, it worked good.

Now to see the speed gain with TVM, I tried

import tvm
from tvm import relay
import numpy as np
import torch
import torchvision
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    '/Users/arjun/datasets/distilbert-base-cased-distilled-squad')

model = torch.jit.load('/Users/arjun/datasets/distilbert-base-cased-distilled-squad.pt')

model.eval()

text = "The Apache Software Foundation is an American nonprofit corporation to support Apache software projects, including the Apache HTTP Server. The ASF was formed from the Apache Group and incorporated on March 25, 1999. The Apache Software Foundation is a decentralized open source community of developers."
question = "When was ASF group formed ?"
encoding = tokenizer.encode_plus(question, text, return_tensors="pt", truncation='only_second', padding='max_length')
print(encoding)
input_ids = encoding["input_ids"]
attention_mask = encoding["attention_mask"]
shape_list = [("input_ids", input_ids.shape), ("attention_mask", attention_mask.shape)]

mod, params = relay.frontend.from_pytorch(model, [input_ids, attention_mask])
print(mod, params)

I am getting this error,

ANTLR runtime and generated code versions disagree: 4.8!=4.7.2
Traceback (most recent call last):
  File "/Users/arjun/tvm/tvm_test.py", line 36, in <module>
    mod, params = relay.frontend.from_pytorch(model, [input_ids, attention_mask])
  File "/Users/arjun/environments/venv/lib/python3.7/site-packages/tvm-0.7.dev1-py3.7-macosx-10.9-x86_64.egg/tvm/relay/frontend/pytorch.py", line 2641, in from_pytorch
    _report_missing_conversion(op_names, convert_map)
  File "/Users/arjun/environments/venv/lib/python3.7/site-packages/tvm-0.7.dev1-py3.7-macosx-10.9-x86_64.egg/tvm/relay/frontend/pytorch.py", line 2127, in _report_missing_conversion
    raise NotImplementedError(msg)
NotImplementedError: The following operators are not implemented: ['aten::masked_fill_']

So is it that the operations inside the model is not yet implemented in TVM side for optimization ? So only as of now pruned models from transformers can be used ? In that case does the question answering finetune model will work ? Coz though the weights are pruned, the way that the language model has learnt is only by finding out the masking words right, so if that works, there should be a way to work with normal models like, distilbert-base-cased-distilled-squad

Correct me if I am wrong. Any help would be appreciated :slight_smile:

Because there are two versions for masked_fill and masked_fill_ in pytorch. Maybe the following code should be added like this code.

"aten::masked_fill_": self.masked_fill,