[VM] Debugging "VM encountered fatal error"

masahi · April 8, 2020, 6:25pm

Hi, I’m looking to add support for Python list in the Torch frontend using Relay List ADT. However, I hit VM encountered fatal error whenever I use nth function in prelude:

github.com

apache/incubator-tvm/blob/master/python/tvm/relay/std/prelude.rly#L66-L75


/*
 * Get the `n`th element of a list.
 */
def @nth[A](%xs: List[A], %n: Tensor[(), int32]) -> A {
  if (%n == 0) {
    @hd(%xs)
  } else {
    @nth(@tl(%xs), %n - 1)
  }
}

It seems this error is thrown when VM runtime encounters a unmatched case in a non-exhaustive pat matching. nth function uses list hd function, which fails at runtime if given an empty list:

github.com

apache/incubator-tvm/blob/master/python/tvm/relay/std/prelude.rly#L51-L55


def @hd[A](%xs: List[A]) -> A {
  match? (%xs) {
    Cons(%x, _) => %x,
  }
}

However, I think I’m correctly supplying a sufficiently sized list from Python before exec. So I have no idea why I’m getting this error. I was able to come up with a minimal example below. This trivial program gives “VM encountered fatal error” no matter what list I give.

fn (%states: List[Tensor[(2, 4), float32]]) -> Tensor[(2, 4), float32] {
  @nth(%states, 0 /* ty=int32 */) /* ty=Tensor[(2, 4), float32] */
}

Is there a good way to debug such issues in VM runtime? I’m new to this part of the code base. @kevinthesun @zhiics @wweic @haichen

haichen · April 8, 2020, 6:32pm

Could you try to create a minimal example in Relay that can trigger the same bug to determine whether it’s a bug in frontend converter or in VM runtime?

masahi · April 8, 2020, 6:50pm

I can, but the minimal example above is created using the modified Torch frontend. Since the IR looks good to me, I don’t think this is a frontend issue.

Here is a Torch module that gives the above example. The type annotation below is very important.

class SimpleList(jit.ScriptModule):
    @jit.script_method
    def forward(self, tensor, states):
        # type: (Tensor, List[Tensor]) -> Tensor
        return states[0]

I’ll try dumping the bytecode in text format, stepping through VM runtime main loop etc. I have a feeling that for some reason my input list is not correctly wired to the runtime.

masahi · April 8, 2020, 8:27pm

Update: The following two programs work. One on list identity function and another one on creating a singleton list inside the function and accessing the list head. So nth function by itself doesn’t seem to be an issue. The error happens when calling nth on a input list.

fn (%states: List[Tensor[(2, 4), float32]]) -> List[Tensor[(2, 4), float32]] {
  %states
}

fn (%input: Tensor[(10, 10), float32]) -> Tensor[(10, 10), float32] {
  %0 = Nil /* ty=List[Tensor[(10, 10), float32]] */;
  %1 = Cons(%input, %0) /* ty=List[Tensor[(10, 10), float32]] */;
  @nth(%1, 0 /* ty=int32 */) /* ty=Tensor[(10, 10), float32] */
}

MarisaKirisame · April 9, 2020, 7:11am

Since the pytorch frontend just generate normal relay code, can you post the result of the relay code frontend generated up?

masahi · April 9, 2020, 7:22am

This is literally the Relay code Torch frontend generates. Is this what you are asking? Or do you want a Python code written by hand that is equivalent to the program below?

fn (%states: List[Tensor[(2, 4), float32]]) -> Tensor[(2, 4), float32] {
  @nth(%states, 0 /* ty=int32 */) /* ty=Tensor[(2, 4), float32] */
}

masahi · April 9, 2020, 2:44pm

Ok I did some debugging to see what’s going on.

First, the good case: If I construct a List ADT at runtime, doing nth on it is no problem.

fn (%input: Tensor[(10, 10), float32]) -> Tensor[(10, 10), float32] {
  %0 = Nil /* ty=List[Tensor[(10, 10), float32]] */;
  %1 = Cons(%input, %0) /* ty=List[Tensor[(10, 10), float32]] */;
  @nth(%1, 0 /* ty=int32 */) /* ty=Tensor[(10, 10), float32] */
}

Here is bytecode dump and execution trace of hd function called in nth:

InvokeGlobal: Argument ADT with size = 2, tag = -2013265920
func.params= 1
Instr: get_tag $1 $0
Instr: load_consti $2 -2013265920
Instr: if $1 $2 1 3
Instr: get_field $3 $0[0]
Instr: goto 3
Instr: fatal
Instr: move $3 $3
Instr: ret $3
Executing(0): get_tag $1 $0
GetTag: tag = -2013265920
Executing(1): load_consti $2 -2013265920
Executing(2): if $1 $2 1 3
Executing(3): get_field $3 $0[0]
Executing(4): goto 3
Executing(7): ret $3
Executing(8): move $8 $7
Executing(9): goto 9
Executing(18): ret $8
Executing(4): ret $4

We can see that the list is correctly recognized as an ADT with 2 field (Cons and Nil), and this list has a tag -2013265920. The hd function checks the input list to see if it has a tag -2013265920 (i.e. check if this is a Cons), if false it immediately raise the “VM encountered fatal error”. The execution trace shows that the hd function successfully returned.

And now, this is a bad case, where I pass a list as argument to the Relay program and try to access its 0th element.

fn (%states: List[Tensor[(2, 4), float32]]) -> Tensor[(2, 4), float32] {
  @nth(%states, 0 /* ty=int32 */) /* ty=Tensor[(2, 4), float32] */
}

InvokeGlobal: Argument ADT with size = 1, tag = 0
func.params= 1
Instr: get_tag $1 $0
Instr: load_consti $2 -2013265920
Instr: if $1 $2 1 3
Instr: get_field $3 $0[0]
Instr: goto 3
Instr: fatal
Instr: move $3 $3
Instr: ret $3
Executing(0): get_tag $1 $0
GetTag: tag = 0
Executing(1): load_consti $2 -2013265920
Executing(2): if $1 $2 1 3
Executing(5): fatal

Note that my list is falsely recognized as an ADT with 1 field, and a tag 0. Inside hd function, since the tag 0 doesn’t match the expected tag -2013265920 of Cons, it raises the fatal error. This is the reason I got “VM encountered fatal error”.

This is not surprising, since the input python list is converted to a Relay ADT by container.tuple_object(...) function, which always create an ADT with tag 0.

github.com

apache/incubator-tvm/blob/9816efc2df63cf6a14a6de46dc2adfafde58acc1/python/tvm/runtime/vm.py#L43


        cargs.append(arg)
    elif isinstance(arg, np.ndarray):
        nd_arr = tvm.nd.array(arg, ctx=tvm.cpu(0))
        cargs.append(nd_arr)
    elif isinstance(arg, tvm.runtime.NDArray):
        cargs.append(arg)
    elif isinstance(arg, (tuple, list)):
        field_args = []
        for field in arg:
            _convert(field, field_args)
        cargs.append(container.tuple_object(field_args))
    elif isinstance(arg, (_base.numeric_types, bool)):
        dtype = "int32" if isinstance(arg, (int, bool)) else "float32"
        value = tvm.nd.array(np.array(arg, dtype=dtype), ctx=tvm.cpu(0))
        cargs.append(value)
    else:
        raise TypeError("Unsupported type: %s" % (type(arg)))




def convert(args):
    cargs = []

github.com

apache/incubator-tvm/blob/d2f9af78ffa958170444fb163fb83dae4aee03de/python/tvm/runtime/container.py#L109




    Returns
    -------
    ret : ADT
        The created object.
    """
    fields = fields if fields else []
    for f in fields:
        assert isinstance(f, ObjectTypes), "Expect object or tvm " \
        "NDArray type, but received : {0}".format(type(f))
    return _Tuple(*fields)




@tvm._ffi.register_object("runtime.String")
class String(Object):
    """The string object.


    Parameters
    ----------
    string : Str
        The string used to construct a runtime String object

github.com

apache/incubator-tvm/blob/d2f9af78ffa958170444fb163fb83dae4aee03de/include/tvm/runtime/container.h#L301




  /*!
   * \brief Construct a tuple object.
   *
   * \tparam Args Type params of tuple feilds.
   * \param args Tuple fields.
   * \return ADT The tuple object reference.
   */
  template <typename... Args>
  static ADT Tuple(Args&&... args) {
    return ADT(0, std::forward<Args>(args)...);
  }


  TVM_DEFINE_OBJECT_REF_METHODS(ADT, ObjectRef, ADTObj);
};


/*! \brief An object representing string. It's POD type. */
class StringObj : public Object {
 public:
  /*! \brief The pointer to string data. */
  const char* data;

Given these findings, I conclude that Relay VM doesn’t support taking a List ADT as arguments from users, even though it can work with ADTs that are created at runtime.

Does this sound about right, and if so, is it desirable to fix this situation? I’m happy to work on it, since I need VM to have a good support for ingesting user supplied complex ADTs for my Torch examples. Currently I’m working on supporting “Stacked LSTM”, which takes “a list of tuple of Tensors” as an argument.

@haichen @MarisaKirisame @kevinthesun @zhiics @wweic

masahi · April 9, 2020, 3:27pm

@haichen @MarisaKirisame

I cooked up a minimal repro script that doesn’t depend on torch or frontend code. https://gist.github.com/masahi/9ac8223833d5dc1ba3e2913c34b44535
It generates the same program I’ve been talking about.

fn (%states: List[Tensor[(2, 4), float32]]) -> Tensor[(2, 4), float32] {
  @nth(%states, 0 /* ty=int32 */) /* ty=Tensor[(2, 4), float32] */
}

zhiics · April 9, 2020, 3:35pm

The ADT you feed from Python looks me to is the runtime ADT data structure that is currently used as Array (with tag 0) for runtime. Should you feed it with the prelude list in stead of a Python List? For example:

 mod = tvm.IRModule()
 p = Prelude(mod)

nil = p.nil
cons = p.cons
l = p.l
a = cons(relay.const(100), nil())
f = relay.Function([], a)

masahi · April 9, 2020, 4:26pm

Yeah I also realized, since I’m feeding a Python list and there is no association between Python list and Relay List ADT, I shouldn’t expect the VM to figure out the Python list I’m feeding is supposed to be a List ADT (it is not!, even though the type says it is).

Can we send arbitrary Relay ADT, created in Python, to VM? It seems we can send Object https://github.com/apache/incubator-tvm/blob/9816efc2df63cf6a14a6de46dc2adfafde58acc1/python/tvm/runtime/vm.py#L32 but it’s so opaque I don’t know what is supposed to be sent via Object.

I need to send Numpy arrays too. So I need to send a prelude list with NDArrays. But since NDArrays are runtime thing, what is the element type of List ADT be, and how can I stuff numpy arrays into it? Is NDArray TensorType?

I think sending a python list of numpy arrays would be a reasonable thing users want to do, so having automatic conversion would be a good addition. Users shouldn’t care about prelude stuff.

masahi · April 9, 2020, 4:34pm

I may be able to wrap NDarrays in relay.const, and stuff them into a List ADT. I’ll try ASAP

UPDATE: Tried converting numpy lists into a prelude List of relay.Constant, using the function below.

def convert_to_list_adt(py_lst, prelude):
    adt_lst = prelude.nil()
    for arr in reversed(py_lst):
        adt_lst = prelude.cons(relay.const(arr), adt_lst)
    return adt_lst

But I got Downcast from relay.Call to vm.ADT failed. error from VM RunLoop. The Call should correspond to prelude cons op. Since Call is a compile time stuff, it shouldn’t come up in VM runtime right? How can I evaluate Call and get List ADT (the result of cons)?

masahi · April 9, 2020, 5:21pm

Ok got the following working by evaluating relay.Constant List with another VM executor to get vm.ADT object and send it to the main function.

gist.github.com

https://gist.github.com/masahi/9ac8223833d5dc1ba3e2913c34b44535

vm_bug.py

import numpy as np

import tvm
from tvm import relay
from tvm.relay.ty import TupleType, TensorType
from tvm.relay.prelude import Prelude


def _get_relay_input_vars(input_shapes, prelude):
    def _is_int_seq(seq):

This file has been truncated. show original

MarisaKirisame · April 9, 2020, 9:27pm

Is there some convert() function lying around? It should, as calling the executor multiple time doesnt seems right…

masahi · April 9, 2020, 9:40pm

_convert function is all we have that converts arg from Python to something Relay understands.

Agree that requiring the user to create another executor to get parameters that can be sent to the VM is not great (although I found it very cool when I got it working for the first time, at least I found the overall design of how ADT works in VM runtime consistent). Is there a way to create VM ADT object “by hand”, rather than “prelude ADT (compile time world)-> evaluate -> VM ADT (runtime world)” way?

github.com

apache/incubator-tvm/blob/9816efc2df63cf6a14a6de46dc2adfafde58acc1/python/tvm/runtime/vm.py#L31-L49


def _convert(arg, cargs):
    if isinstance(arg, Object):
        cargs.append(arg)
    elif isinstance(arg, np.ndarray):
        nd_arr = tvm.nd.array(arg, ctx=tvm.cpu(0))
        cargs.append(nd_arr)
    elif isinstance(arg, tvm.runtime.NDArray):
        cargs.append(arg)
    elif isinstance(arg, (tuple, list)):
        field_args = []
        for field in arg:
            _convert(field, field_args)
        cargs.append(container.tuple_object(field_args))
    elif isinstance(arg, (_base.numeric_types, bool)):
        dtype = "int32" if isinstance(arg, (int, bool)) else "float32"
        value = tvm.nd.array(np.array(arg, dtype=dtype), ctx=tvm.cpu(0))
        cargs.append(value)
    else:
        raise TypeError("Unsupported type: %s" % (type(arg)))

haichen · April 10, 2020, 5:29am

You can use runtime ADT to create ADT object. Not necessary to use prelude and run with VM to get it.

github.com

apache/incubator-tvm/blob/9816efc2df63cf6a14a6de46dc2adfafde58acc1/python/tvm/runtime/container.py#L63




    if idx < -length or idx >= length:
        raise IndexError("Index out of range. size: {}, got index {}"
                         .format(length, idx))
    if idx < 0:
        idx += length
    return elem_getter(obj, idx)




@tvm._ffi.register_object("vm.ADT")
class ADT(Object):
    """Algebatic data type(ADT) object.


    Parameters
    ----------
    tag : int
        The tag of ADT.


    fields : list[Object] or tuple[Object]
        The source tuple.
    """

I made some change to your script and now it passes the test.

gist.github.com

https://gist.github.com/icemelon9/cd41746fefac55d033f06059df69c747

test_nth.py

import numpy as np

import tvm
from tvm import relay
from tvm.relay.ty import TupleType, TensorType
from tvm.relay.prelude import Prelude
from tvm.runtime.container import ADT

def _get_relay_input_vars(input_shapes, prelude):
    def _is_int_seq(seq):

This file has been truncated. show original

haichen · April 10, 2020, 5:34am

The reason that we don’t convert python list to ADT List is because it’s more common to use tuple as input instead of the recursive defined ADT List. In python, people usually mix the usage between tuple and list. If we convert list to ADT List, I’m afraid that it might cause confusion from more people. Therefore, it’s better to require people explicitly create ADT object by themselves imho. But I’m open to discussion on this.

masahi · April 10, 2020, 7:27am

Great! It is much better. Thank you very much.

Yeah, agree that as a way of passing parameters to the runtime, tuple is sufficient for most uses and straightforward to use. But maybe as we add support for more dynamism, we would want a better support for variable length inputs. For now I’m happy with the approach you showed of manually creating VM ADT.