[RFC] ICHECK and CHECK conventions and error messages

Attached an actionable version per our offline discussion. Co-authored with @electriclilies.

[RFC] Improvements to Error Reporting

Error handling is generally important in software development. Without proper error reporting, the software stack is less approachable to community new-comers and maintainers.

Good and Bad Error Messages

Errors are designed to help developers understand what went wrong. There are three parts of every good error message:

A1. The immediate issue: index out of bound, type error, etc.

A2. The context of the error: wrong inputs, operator not registered, internal compiler error like a pass is wrong.

A3. How to reproduce it: the input IR and the name of the pass which errors out.

Examples. Less-informative error messages.

  1. Deep inside a pass, it throws IndexError: Indexing 3 but the array length is 2. It does reveal an IndexError (A1), but it gives zero evidence of A2 or A3 , because the place is nested too deeply to get a concrete context of what TVM is doing.
  2. When calling a packed function, the type checker throws a type error (A1). However, we have no idea what the function is, or why this function is called. (A2, A3)
  3. In a recursive visitor, we received a LOG(FATAL) in a stack of length 5000 (A1). However, we have no idea what is being visited (A3), and which span of the IR causes the error (A2).

Examples. Informative error messages.

  1. An IndexError is encountered: indexing 3 but the array length is 2 (A1) in the TIR pass BufferFlatten (A2). The input TIR is: . Please use to reproduce the error (A3).
  2. In the compile engine (A2), when lowering Relay to TE, a ScheduleError is encountered (A1). The Relay IR is: . Please use to reproduce the error (A3).
  3. An unexpected internal compiler (A1, A2) error is encountered. The input TIR causing the error is: (A3).

Proposed Changes

Right now, most errors in TVM do not provide context and instructions to reproduce the error (A2 and A3). There are a few reasons for this. Firstly, TVM is an open source project, and so enforcing coding conventions is difficult. Currently, we do not have a policy requiring developers to write error messages. Second, a lot of errors occur far away from the source of the actual error, which prevents developers from providing the appropriate context and and reproducibility information.

To remedy this, we propose two changes:

P1. Policy: Requiring error messages for all CHECKs, and suggesting error messages for ICHECKs.

As a reminder, CHECKs are meant for user-facing errors, while ICHECKs check internal compiler invariants.

We can require developers to submit more detailed error messages. However, requiring developers to write error messages for every trivial ICHECK could deter them from checking things that they should (e.g., nullptr checks for things that are unlikely to happen). To find a balance, we suggest

  • ICHECK: We suggest but do not require every ICHECK to have error messages. Adding error messages are definitely beneficial.
  • CHECK: We require every CHECK to have error messages
  • Error type: We require every error message to have an error type. For example, the error type of CHECK(cond) << "IndexError: Array index out of bound"... will be parsed into IndexError
  • If a check is inside a boundary that provides a minimal reproducible example, then we can directly use ICHECK and not to provide error message.
  • In general, nullability checks can be ICHECK and not to provide error message.

P2. Error infrastructure: boundary checking

We cannot expect developers to always write error messages containing A2 and A3, because errors often occur locally, and they most likely cannot provide the pass context and instructions to reproduce. Additionally, if the pass context were to change, we would have to modify many error messages to keep them correct.

To remedy this problem, we propose adding infrastructure between important boundaries like passes that reports errors occurring inside and providing context and instructions to reproduce:

  • Outside each pass, we need a clear boundary that catches all the exceptions happens inside the pass. If it happens, we should add the pass’s name and input in the error message.
  • In compile engine, when lowering to TE compute/schedule, we need a clear boundary that prints out which subgraph of Relay and which operator’s schedule causes the problem, so that it could ease our debugging.
  • Other examples include: importers/parsers to Relay; TOPI operators; code generation modules; etc.

The boundary will help provide information about how to reproduce, (A3) and potentially provides a useful context (A2).

2 Likes