[Question] How TVM run text generation model like gpt2

irasin · January 19, 2023, 8:44am

If we run text generation using gpt2 from huggingface, the seq_len increase each time because it is an autoregressive model. In other words, the model inputs is dynamic, can TVM support such text generation case?

junrushao · January 19, 2023, 4:51pm

Relax supports dynamic shape by design, so it’s highly recommended to check it out

irasin · January 28, 2023, 1:42am

Hey, thanks for your reply. I wonder where can we find the official docs and demos about Relax and when can we use Relax in TVM main branch?

zhaoyang-star · March 22, 2023, 8:33am

As far as I know, Relax has not released by now. How to use this feature?

zhaoyang-star · March 22, 2023, 9:34am

The script below is to convert GPT2 to a Torch Scipted model and verify it both by using PyTorch inference and TVM inference. PyTorch inference is all right while TVM inference occurs error.

Error KeyError: 'default_token.1' occured when run relay.frontend.from_pytorch. I printed out the PyTorch graph as following and found defalut_token in the graph. May I need to add some custom op conversation? @masahi

Thanks~~

graph(%self.1 : __torch__.FinishMySentence,
      %x.1 : Tensor):
  %34 : bool = prim::Constant[value=1]()
  %31 : int = prim::Constant[value=1]()
  %29 : NoneType = prim::Constant()
  %12 : int = prim::Constant[value=9223372036854775807]() 
  %25 : int = prim::Constant[value=0]() 
  %26 : int = prim::Constant[value=-1]() 
  %default_token.1 : Tensor = prim::GetAttr[name="default_token"](%self.1)
  %eos.1 : Tensor = prim::GetAttr[name="eos"](%self.1)
  %9 : Tensor = aten::ne(%default_token.1, %eos.1)
  %11 : bool = aten::Bool(%9) 
  %sentence : Tensor = prim::Loop(%12, %11, %x.1)
    block0(%15 : int, %sentence.9 : Tensor):
      %next_token_predictor.1 : __torch__.transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel = prim::GetAttr[name="next_token_predictor"](%self.1)
      %20 : (Tensor, ((Tensor, Tensor), (Tensor, Tensor), (Tensor, Tensor), (Tensor, Tensor), (Tensor, Tensor), (Tensor, Tensor), (Tensor, Tensor), (Tensor, Tensor), (Tensor, Tensor), (Tensor, Tensor), (Tensor, Tensor), (Tensor, Tensor))) = prim::CallMethod[name="forward"](%next_token_predictor.1, %sentence.9) # demo2.py:23:29
      %predictions.1 : Tensor, %23 : ((Tensor, Tensor), (Tensor, Tensor), (Tensor, Tensor), (Tensor, Tensor), (Tensor, Tensor), (Tensor, Tensor), (Tensor, Tensor), (Tensor, Tensor), (Tensor, Tensor), (Tensor, Tensor), (Tensor, Tensor), (Tensor, Tensor)) = prim::TupleUnpack(%20)
      %27 : Tensor = aten::select(%predictions.1, %25, %26) 
      %32 : Tensor = aten::slice(%27, %25, %29, %29, %31) 
      %token.1 : Tensor = aten::argmax(%32, %25, %34) 
      %38 : Tensor[] = prim::ListConstruct(%sentence.9, %token.1)
      %sentence0.1 : Tensor = aten::cat(%38, %25) 
      %eos : Tensor = prim::GetAttr[name="eos"](%self.1)
      %45 : Tensor = aten::ne(%token.1, %eos) 
      %47 : bool = aten::Bool(%45) 
      -> (%47, %sentence0.1)
  return (%sentence)

import torch
import numpy as np
from transformers import GPT2LMHeadModel, GPT2Tokenizer


class FinishMySentence(torch.nn.Module):
    def __init__(self, model=None, eos=198):
        super(FinishMySentence, self).__init__()
        self.eos = torch.tensor([eos])
        self.next_token_predictor = model
        self.default_token = torch.tensor([0])

    def forward(self, x):
        sentence = x
        token = self.default_token
        while token != self.eos:
            predictions, _ = self.next_token_predictor(sentence)
            token = torch.argmax(predictions[-1, :], dim=0, keepdim=True)
            sentence = torch.cat((sentence, token), 0)

        return sentence


# Convert to scripted model
token_predictor = GPT2LMHeadModel.from_pretrained("gpt2", torchscript=True).eval()

# trace
random_tokens = torch.randint(10000, (5,))
traced_token_predictor = torch.jit.trace(token_predictor, random_tokens)
torch.jit.save(traced_token_predictor, "traced_gpt2.pt")

# script
model = FinishMySentence(model=traced_token_predictor)
scripted_model = torch.jit.script(model)
torch.jit.save(scripted_model, "scripted_gpt2.pt")


# Use PyTorch inference
sentence_fragment = "The Manhattan bridge is a major"

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
context = torch.tensor(tokenizer.encode(sentence_fragment))

# torch_out = scripted_model(context)
loaded_model = torch.jit.load("scripted_gpt2.pt").eval()
print(loaded_model.graph)
torch_out = loaded_model(context)
generated_text_torch = tokenizer.decode(torch_out)
print("Fragment: {}".format(sentence_fragment))
print("Completed: {}".format(generated_text_torch))


# Use TVM
import tvm
from tvm import relay

inputs = [("dummy_input_name", (5,))]
mod, params = relay.frontend.from_pytorch(loaded_model, inputs)
print(mod)

masahi · March 22, 2023, 12:47pm

You should use torch.jit.trace.

zhaoyang-star · March 23, 2023, 12:58am

Thanks for your kind reply. Sorry for my unfamiliar with Torch, I still cann’t get the point.

As there is while loop in model so I used jit.script to generate TorchScript model named scripted_gpt2.pt.

Could you give me more hint? @masahi

masahi · March 23, 2023, 7:39am

ok it seems you are trying to convert the decoding step (FinishMySentence) of text generation. This is not a learned component, so there is little point in converting it to TVM.

On the other hand, GPT2LMHeadModel should be converted to TVM without issues (what you call traced_token_predictor). If not, I can take a look.

zhaoyang-star · March 23, 2023, 8:02am

There is an error occured after changed to mod, params = relay.frontend.from_pytorch(traced_token_predictor, inputs, default_dtype="int64")

The value type is tvm.relay.expr.Call while expected to be scalar or NDArray.

Traceback (most recent call last):
  File "test1.py", line 58, in <module>
    mod, params = relay.frontend.from_pytorch(traced_token_predictor, inputs)
  File "/WORK/Dev/tvm/python/tvm/relay/frontend/pytorch.py", line 4173, in from_pytorch
    outputs = converter.convert_operators(_get_operator_nodes(graph.nodes()), outputs, ret_name)
  File "/WORK/Dev/tvm/python/tvm/relay/frontend/pytorch.py", line 3547, in convert_operators
    relay_out = relay_op(
  File "/WORK/Dev/tvm/python/tvm/relay/frontend/pytorch.py", line 750, in full
    return self.full_impl(data, fill_value, dtype)
  File "/WORK/Dev/tvm/python/tvm/relay/frontend/pytorch.py", line 671, in full_impl
    out = _op.full(_expr.const(fill_value, dtype=dtype), size, dtype=dtype)
  File "/WORK/Dev/tvm/python/tvm/relay/expr.py", line 517, in const
    raise ValueError("value has to be scalar or NDArray")
ValueError: value has to be scalar or NDArray

masahi · March 23, 2023, 8:26am

It works for me using this script:

from tvm import relay

import torch
from transformers import GPT2LMHeadModel

token_predictor = GPT2LMHeadModel.from_pretrained("gpt2", torchscript=True).eval()

random_tokens = torch.randint(10000, (5,))
traced_token_predictor = torch.jit.trace(token_predictor, random_tokens)

inputs = [("dummy_input_name", (5,))]
mod, params = relay.frontend.from_pytorch(traced_token_predictor, inputs, default_dtype="int64")
print(mod)

zhaoyang-star · March 23, 2023, 9:34am

I updated tvm to the latest version and it works. Thanks a lot for your kind help

zhaoyang-star · March 29, 2023, 5:50am

Because gpt2 requires the input size to increase at each step, under the code above and static shape for current tvm (main branch), I can only do inference on a fixed sequence length. How to solvo this problem? I know relax may solve this problem.

masahi · March 29, 2023, 6:55am

It is also possible to import the model with dynamic shape in Relay. But the performance would be extremely poor.

zhaoyang-star · March 29, 2023, 7:53am

It is also possible to import the model with dynamic shape in Relay.

Is there some demo code for this? I want have a try.

Another question is when will the next version be released with Relax?

masahi · March 29, 2023, 8:26am

github.com

apache/tvm/blob/a0edf24c60bad81a6f4a4333fbf2b63255a37882/tests/python/frontend/onnx/test_forward.py#L1588-L1629


def verify_simple_dynamic_model(a_shape, b_shape, target, dev):
    """verify_simple_dynamic_model"""


    def verify_model(model, a_shape, b_shape):
        a_array = np.random.uniform(size=a_shape).astype("float32")
        b_array = np.random.uniform(size=b_shape).astype("float32")
        # matmul
        out_np = np.matmul(a_array, b_array)
        # relu
        out_np[out_np < 0] = 0


        tvm_out = model(a_array, b_array).numpy()
        tvm.testing.assert_allclose(out_np, tvm_out, rtol=1e-5, atol=1e-5)


    mul_node = helper.make_node("MatMul", ["a", "b"], ["out"])
    relu_node = helper.make_node("Relu", ["out"], ["relu"])


    a_array = np.random.uniform(size=a_shape).astype("float32")
    b_array = np.random.uniform(size=b_shape).astype("float32")
    # matmul

This file has been truncated. show original

This is an example for ONNX, and PT frontend doesn’t support dynamic input shape. But it’s not difficult to add such feature.

zhaoyang-star · March 29, 2023, 8:36am

Thanks for your kind help. I will have a try. Another question is when will the next version be released with Relax?

tqchen · April 16, 2023, 7:26pm

checkout Introducing Web-LLM: Running large language model on web

zhaoyang-star · April 19, 2023, 6:07am

Amazing! I will try it on the browser.

pranjalvyas15 · May 8, 2023, 2:13pm

Hi @zhaoyang-star how will we generate the sentence after getting mods and params.Please help using tvm

chenugray · May 15, 2023, 1:54am

when will tvm support dynamic shape and dynamic shape turing with gpu? @tqchen