[VM] Debugging "VM encountered fatal error"

Hi, I’m looking to add support for Python list in the Torch frontend using Relay List ADT. However, I hit VM encountered fatal error whenever I use nth function in prelude:

It seems this error is thrown when VM runtime encounters a unmatched case in a non-exhaustive pat matching. nth function uses list hd function, which fails at runtime if given an empty list:

However, I think I’m correctly supplying a sufficiently sized list from Python before exec. So I have no idea why I’m getting this error. I was able to come up with a minimal example below. This trivial program gives “VM encountered fatal error” no matter what list I give.

fn (%states: List[Tensor[(2, 4), float32]]) -> Tensor[(2, 4), float32] {
  @nth(%states, 0 /* ty=int32 */) /* ty=Tensor[(2, 4), float32] */
}

Is there a good way to debug such issues in VM runtime? I’m new to this part of the code base. @kevinthesun @zhiics @wweic @haichen

Could you try to create a minimal example in Relay that can trigger the same bug to determine whether it’s a bug in frontend converter or in VM runtime?

I can, but the minimal example above is created using the modified Torch frontend. Since the IR looks good to me, I don’t think this is a frontend issue.

Here is a Torch module that gives the above example. The type annotation below is very important.

class SimpleList(jit.ScriptModule):
    @jit.script_method
    def forward(self, tensor, states):
        # type: (Tensor, List[Tensor]) -> Tensor
        return states[0]

I’ll try dumping the bytecode in text format, stepping through VM runtime main loop etc. I have a feeling that for some reason my input list is not correctly wired to the runtime.

Update: The following two programs work. One on list identity function and another one on creating a singleton list inside the function and accessing the list head. So nth function by itself doesn’t seem to be an issue. The error happens when calling nth on a input list.

fn (%states: List[Tensor[(2, 4), float32]]) -> List[Tensor[(2, 4), float32]] {
  %states
}

fn (%input: Tensor[(10, 10), float32]) -> Tensor[(10, 10), float32] {
  %0 = Nil /* ty=List[Tensor[(10, 10), float32]] */;
  %1 = Cons(%input, %0) /* ty=List[Tensor[(10, 10), float32]] */;
  @nth(%1, 0 /* ty=int32 */) /* ty=Tensor[(10, 10), float32] */
}

Since the pytorch frontend just generate normal relay code, can you post the result of the relay code frontend generated up?

This is literally the Relay code Torch frontend generates. Is this what you are asking? Or do you want a Python code written by hand that is equivalent to the program below?

fn (%states: List[Tensor[(2, 4), float32]]) -> Tensor[(2, 4), float32] {
  @nth(%states, 0 /* ty=int32 */) /* ty=Tensor[(2, 4), float32] */
}

Ok I did some debugging to see what’s going on.

First, the good case: If I construct a List ADT at runtime, doing nth on it is no problem.

fn (%input: Tensor[(10, 10), float32]) -> Tensor[(10, 10), float32] {
  %0 = Nil /* ty=List[Tensor[(10, 10), float32]] */;
  %1 = Cons(%input, %0) /* ty=List[Tensor[(10, 10), float32]] */;
  @nth(%1, 0 /* ty=int32 */) /* ty=Tensor[(10, 10), float32] */
}

Here is bytecode dump and execution trace of hd function called in nth:

InvokeGlobal: Argument ADT with size = 2, tag = -2013265920
func.params= 1
Instr: get_tag $1 $0
Instr: load_consti $2 -2013265920
Instr: if $1 $2 1 3
Instr: get_field $3 $0[0]
Instr: goto 3
Instr: fatal
Instr: move $3 $3
Instr: ret $3
Executing(0): get_tag $1 $0
GetTag: tag = -2013265920
Executing(1): load_consti $2 -2013265920
Executing(2): if $1 $2 1 3
Executing(3): get_field $3 $0[0]
Executing(4): goto 3
Executing(7): ret $3
Executing(8): move $8 $7
Executing(9): goto 9
Executing(18): ret $8
Executing(4): ret $4

We can see that the list is correctly recognized as an ADT with 2 field (Cons and Nil), and this list has a tag -2013265920. The hd function checks the input list to see if it has a tag -2013265920 (i.e. check if this is a Cons), if false it immediately raise the “VM encountered fatal error”. The execution trace shows that the hd function successfully returned.

And now, this is a bad case, where I pass a list as argument to the Relay program and try to access its 0th element.

fn (%states: List[Tensor[(2, 4), float32]]) -> Tensor[(2, 4), float32] {
  @nth(%states, 0 /* ty=int32 */) /* ty=Tensor[(2, 4), float32] */
}
InvokeGlobal: Argument ADT with size = 1, tag = 0
func.params= 1
Instr: get_tag $1 $0
Instr: load_consti $2 -2013265920
Instr: if $1 $2 1 3
Instr: get_field $3 $0[0]
Instr: goto 3
Instr: fatal
Instr: move $3 $3
Instr: ret $3
Executing(0): get_tag $1 $0
GetTag: tag = 0
Executing(1): load_consti $2 -2013265920
Executing(2): if $1 $2 1 3
Executing(5): fatal

Note that my list is falsely recognized as an ADT with 1 field, and a tag 0. Inside hd function, since the tag 0 doesn’t match the expected tag -2013265920 of Cons, it raises the fatal error. This is the reason I got “VM encountered fatal error”.

This is not surprising, since the input python list is converted to a Relay ADT by container.tuple_object(...) function, which always create an ADT with tag 0.

Given these findings, I conclude that Relay VM doesn’t support taking a List ADT as arguments from users, even though it can work with ADTs that are created at runtime.

Does this sound about right, and if so, is it desirable to fix this situation? I’m happy to work on it, since I need VM to have a good support for ingesting user supplied complex ADTs for my Torch examples. Currently I’m working on supporting “Stacked LSTM”, which takes “a list of tuple of Tensors” as an argument.

@haichen @MarisaKirisame @kevinthesun @zhiics @wweic

@haichen @MarisaKirisame

I cooked up a minimal repro script that doesn’t depend on torch or frontend code. https://gist.github.com/masahi/9ac8223833d5dc1ba3e2913c34b44535
It generates the same program I’ve been talking about.

fn (%states: List[Tensor[(2, 4), float32]]) -> Tensor[(2, 4), float32] {
  @nth(%states, 0 /* ty=int32 */) /* ty=Tensor[(2, 4), float32] */
}

The ADT you feed from Python looks me to is the runtime ADT data structure that is currently used as Array (with tag 0) for runtime. Should you feed it with the prelude list in stead of a Python List? For example:

 mod = tvm.IRModule()
 p = Prelude(mod)

nil = p.nil
cons = p.cons
l = p.l
a = cons(relay.const(100), nil())
f = relay.Function([], a)

Yeah I also realized, since I’m feeding a Python list and there is no association between Python list and Relay List ADT, I shouldn’t expect the VM to figure out the Python list I’m feeding is supposed to be a List ADT (it is not!, even though the type says it is).

Can we send arbitrary Relay ADT, created in Python, to VM? It seems we can send Object https://github.com/apache/incubator-tvm/blob/9816efc2df63cf6a14a6de46dc2adfafde58acc1/python/tvm/runtime/vm.py#L32 but it’s so opaque I don’t know what is supposed to be sent via Object.

I need to send Numpy arrays too. So I need to send a prelude list with NDArrays. But since NDArrays are runtime thing, what is the element type of List ADT be, and how can I stuff numpy arrays into it? Is NDArray TensorType?

I think sending a python list of numpy arrays would be a reasonable thing users want to do, so having automatic conversion would be a good addition. Users shouldn’t care about prelude stuff.

I may be able to wrap NDarrays in relay.const, and stuff them into a List ADT. I’ll try ASAP

UPDATE: Tried converting numpy lists into a prelude List of relay.Constant, using the function below.

def convert_to_list_adt(py_lst, prelude):
    adt_lst = prelude.nil()
    for arr in reversed(py_lst):
        adt_lst = prelude.cons(relay.const(arr), adt_lst)
    return adt_lst

But I got Downcast from relay.Call to vm.ADT failed. error from VM RunLoop. The Call should correspond to prelude cons op. Since Call is a compile time stuff, it shouldn’t come up in VM runtime right? How can I evaluate Call and get List ADT (the result of cons)?

Ok got the following working by evaluating relay.Constant List with another VM executor to get vm.ADT object and send it to the main function.

Is there some convert() function lying around? It should, as calling the executor multiple time doesnt seems right…

_convert function is all we have that converts arg from Python to something Relay understands.

Agree that requiring the user to create another executor to get parameters that can be sent to the VM is not great (although I found it very cool when I got it working for the first time, at least I found the overall design of how ADT works in VM runtime consistent). Is there a way to create VM ADT object “by hand”, rather than “prelude ADT (compile time world)-> evaluate -> VM ADT (runtime world)” way?

You can use runtime ADT to create ADT object. Not necessary to use prelude and run with VM to get it.

I made some change to your script and now it passes the test.

1 Like

The reason that we don’t convert python list to ADT List is because it’s more common to use tuple as input instead of the recursive defined ADT List. In python, people usually mix the usage between tuple and list. If we convert list to ADT List, I’m afraid that it might cause confusion from more people. Therefore, it’s better to require people explicitly create ADT object by themselves imho. But I’m open to discussion on this.

1 Like

Great! It is much better. Thank you very much.

Yeah, agree that as a way of passing parameters to the runtime, tuple is sufficient for most uses and straightforward to use. But maybe as we add support for more dynamism, we would want a better support for variable length inputs. For now I’m happy with the approach you showed of manually creating VM ADT.

1 Like