[RFC] ICHECK and CHECK conventions and error messages

tkonolige · December 17, 2020, 8:03pm

The goal of this RFC is to codify how we would like to write error messages and when we want to use them. The results of this discussion will be included into https://tvm.apache.org/docs/contribute/error_handling.html. Below I’ve included what I think the document should look like. Please give feedback on the content, the wording can be improved when committing.

My Proposal

ICHECK vs CHECK

TVM’s C++ codebase contains two different types of assert statements: ICHECK and CHECK. ICHECK should be used in places where an internal compiler invariant was violated. ICHECK will print a stack trace along with an error message asking users to report this internal compiler error to the TVM github or discuss. CHECK is used in places where the error is the users fault. For example, when the input shapes were incorrect or when data type do not match. These errors are displayed in a friendly manner to the user.

Writing good error messages

ICHECK and CHECK must always contain a useful error message. Explicit error messages make it easier for both developers and users to understand what went wrong. This error message should be actionable. Whoever encounters an error should be able to take steps to fix the error based off of the error message. For example, if a function requires an integer parameter, instead of saying “unexpected float” when a floating point value is passed, try saying “MyFunc requires an integer argument, but a floating point one was provided”. Error messages should contain all useful context that is available. For example, “Expected float, got int”, say “MyFunc expects an int argument, but a float was provided” (the new error message now include the function name). If working with TIR/TE/Relay nodes, you should include the span information from these nodes in the error message to help the user locate the error in their code.

Here are my reasons for requiring messages on all assert statements:

Error messages serve as both internal (for developers) and external (for users) documentation. Error messages are like an improved comment.
The current codebase lacks error messages in a lot of places. Requiring them on all incoming commits helps build a culture of documenting the code base.
Adding an error message requires little time commitment vs the amount of benefit it produces.

Please give feedback on what other guidance we should add for writing error messages.

@tqchen @junrushao @jroesch @ziheng @hogepodge

tqchen · December 17, 2020, 8:34pm

First of all, I agree in the general that it is important to write clear error messages for user facing cases.

Here are the things that i think we all agree on

For user facing errors (e.g. input dimension of matmul should be 2), CHECK should be used, with a clear error message.
For internal error that may not be obvious from the context, some level of error message and comment would be helpful.

One thing that I think worth discussing further is what to do with invariant checks that are less likely to hit to the user side. In many cases, we encourage developers to write these ICHECKs when they see fit, most as a habit to improve code robustness(akin to assert), but not necessarily want to force them to think about a clear error message for each of the instance.

Not all checks are equal to each other and it is important to take the context into consideration. In the particular case of ICHECK, where the invariance is less likely going to hit the user side, I think it is OK to give developers the flexibility to write short ICHECK if they prefer, while enforcing more clear error messages on places that matters.

We can find many similar examples(like the ones below) in other large code base like TensorFlow and PyTorch, doing so doesn’t make pytorch or tensorflow not product ready.

We can make a similar analogy to adding code comments to the codebase. While we could have add code comment for every function and every code we wrote, we prioritize on the interface functions and those that requires more clarifications during the code review.

Of course this does not mean that we should not add error messages to the checks that could cause user facing errors. We should actually shift our attention to address these checks quickly, while giving flexibility to the developers in other cases that have minimum gain.

In summary, I am fully supportive of adding informative error messages especially for those that are obviously user facing, but would argue we should enforce an error message for every internal invariance checks to strike a balance.

In short, the context matters. We need to strike a balance between the areas we can prioitize on. The reviewers who shouldread should through the PRs, and provide these suggestions holistically, based on the specific context, the readability, and how close the check is to the user facing. For user facing checks for sure an informative error messages are needed.

giuseros · December 18, 2020, 10:26am

Hi @tkonolige, @tqchen,

My 2c. When developing in TIR, being able to read a description of the error is always beneficial: it is like giving context to the lines around before error.

It doesn’t have to be an extensive explanation, but also a very simple message might save 2/3 minutes of reading around to understand the nature of the error.

Consider that during a typical development day, I can hit ICHECKs tens of times, so at the end of the day, it might have an impact on productivity.

Also, since I am talking about simple messages to describe the errors, I don’t think this would be too onerous on the developer, as it is like writing a short comment to explain what’s going on.

junrushao · December 22, 2020, 9:15pm

Attached an actionable version per our offline discussion. Co-authored with @electriclilies.

[RFC] Improvements to Error Reporting

Error handling is generally important in software development. Without proper error reporting, the software stack is less approachable to community new-comers and maintainers.

Good and Bad Error Messages

Errors are designed to help developers understand what went wrong. There are three parts of every good error message:

A1. The immediate issue: index out of bound, type error, etc.

A2. The context of the error: wrong inputs, operator not registered, internal compiler error like a pass is wrong.

A3. How to reproduce it: the input IR and the name of the pass which errors out.

Examples. Less-informative error messages.

Deep inside a pass, it throws IndexError: Indexing 3 but the array length is 2. It does reveal an IndexError (A1), but it gives zero evidence of A2 or A3 , because the place is nested too deeply to get a concrete context of what TVM is doing.
When calling a packed function, the type checker throws a type error (A1). However, we have no idea what the function is, or why this function is called. (A2, A3)
In a recursive visitor, we received a LOG(FATAL) in a stack of length 5000 (A1). However, we have no idea what is being visited (A3), and which span of the IR causes the error (A2).

Examples. Informative error messages.

An IndexError is encountered: indexing 3 but the array length is 2 (A1) in the TIR pass BufferFlatten (A2). The input TIR is: . Please use to reproduce the error (A3).
In the compile engine (A2), when lowering Relay to TE, a ScheduleError is encountered (A1). The Relay IR is: . Please use to reproduce the error (A3).
An unexpected internal compiler (A1, A2) error is encountered. The input TIR causing the error is: (A3).

Proposed Changes

Right now, most errors in TVM do not provide context and instructions to reproduce the error (A2 and A3). There are a few reasons for this. Firstly, TVM is an open source project, and so enforcing coding conventions is difficult. Currently, we do not have a policy requiring developers to write error messages. Second, a lot of errors occur far away from the source of the actual error, which prevents developers from providing the appropriate context and and reproducibility information.

To remedy this, we propose two changes:

P1. Policy: Requiring error messages for all CHECKs, and suggesting error messages for ICHECKs.

As a reminder, CHECKs are meant for user-facing errors, while ICHECKs check internal compiler invariants.

We can require developers to submit more detailed error messages. However, requiring developers to write error messages for every trivial ICHECK could deter them from checking things that they should (e.g., nullptr checks for things that are unlikely to happen). To find a balance, we suggest

ICHECK: We suggest but do not require every ICHECK to have error messages. Adding error messages are definitely beneficial.
CHECK: We require every CHECK to have error messages
Error type: We require every error message to have an error type. For example, the error type of CHECK(cond) << "IndexError: Array index out of bound"... will be parsed into IndexError
If a check is inside a boundary that provides a minimal reproducible example, then we can directly use ICHECK and not to provide error message.
In general, nullability checks can be ICHECK and not to provide error message.

P2. Error infrastructure: boundary checking

We cannot expect developers to always write error messages containing A2 and A3, because errors often occur locally, and they most likely cannot provide the pass context and instructions to reproduce. Additionally, if the pass context were to change, we would have to modify many error messages to keep them correct.

To remedy this problem, we propose adding infrastructure between important boundaries like passes that reports errors occurring inside and providing context and instructions to reproduce:

Outside each pass, we need a clear boundary that catches all the exceptions happens inside the pass. If it happens, we should add the pass’s name and input in the error message.
In compile engine, when lowering to TE compute/schedule, we need a clear boundary that prints out which subgraph of Relay and which operator’s schedule causes the problem, so that it could ease our debugging.
Other examples include: importers/parsers to Relay; TOPI operators; code generation modules; etc.

The boundary will help provide information about how to reproduce, (A3) and potentially provides a useful context (A2).