[RFC] Meta-RFC: 3-pronged plan for improving error messages in TVM

jroesch · July 7, 2020, 12:13am

Meta-RFC

This RFC proposes that we introduce 3 new RFCs for handling errors in TVM. After talking with many stakeholders in the community and much of the issues we’ve had onboarding new people to TVM at OctoML I have identified three critical areas in which I believe we can improve TVM error reporting. I want to solicit help from the community in driving these three directions forward and coming up with a design that makes current and new contributors and users happy.

User visible errors vs. internal errors.

Currently the TVM codebase makes heavy use of the CHECK family of macros in order to enforce and report errors. Although easy to use these provide a subpar user experience for most users showing large stack traces and often inscrutable errors. The first direction is introducing a separation between user visible errors, and internal errors.

It is my belief that most of our current use of check today should be migrated over to a family of macros which I will call ICHECK for internal check. These macros (like other compilers) should provide a user friendly error message that an internal error has occurred in the compiler and they should report it to the discussion forum or issue tracker. We should leave “true” errors alone, and migrate them to the second or third part of my proposal, and over time remove the use of raw checks as our default error strategy.

For example see a Rust program which triggers and internal error.

   Compiling playground v0.0.1 (/playground)
error[E0425]: cannot find value `rust` in this scope
 --> src/main.rs:2:11
  |
2 |     break rust;
  |           ^^^^ not found in this scope

error[E0268]: `break` outside of a loop
 --> src/main.rs:2:5
  |
2 |     break rust;
  |     ^^^^^^^^^^ cannot `break` outside of a loop

error: internal compiler error: It looks like you're trying to break rust; would you like some ICE?

note: the compiler expectedly panicked. this is a feature.

note: we would appreciate a joke overview: https://github.com/rust-lang/rust/issues/43162#issuecomment-320764675

note: rustc 1.44.1 (c7087fe00 2020-06-17) running on x86_64-unknown-linux-gnu

error: aborting due to 3 previous errors

Some errors have detailed explanations: E0268, E0425.
For more information about an error, try `rustc --explain E0268`.
error: could not compile `playground`.

To learn more, run the command again with --verbose.

I believe we should simply change the family of check macros to raise a new exception type which can be later caught and rendered appropriately.

Explicit and precise error reporting on program text

My second suggestion is we revive my previous efforts to provide precise program error reporting. I have begun work on this and will open a longer form RFC focused on this effort next. The goal of this effort is to bring traditional compiler error reporting to TVM, enabling line by line errors to be rendered with potentially more diagnostic information.

example

We will support two modes, one where source information is explicitly attached so that we may report errors back to Python or another input language, for use in frameworks such as MxNet or future tools. The other will use both our pretty printing and parsing infrastructure which I have been recently working on in order to produce a backing piece of source code to render errors against.

To make this useful we will need to convert much of the current error handling to instead use the new diagnostic machinery, thus replacing most of the remaining check calls after first moving to ICHECK.

My plan is to first do this for Relay, then TIR thus enabling this functionality for all of the unified IR.

Error handling callbacks in generated code.

My final suggestion is that we emit special symbols in the generated code which allows the functions to trap into more user friendly error handling when an in input invariant is not met, instead of inscrutable arg1.ndim does not equal n messages we can produce structured messages, or possible embed compile time information. This feature needs help being designed an implemented and if anyone is interested this would be a great starting point in TVM.

Next Steps

In order to drive this effort forward I would like to first collect a large variety of error messages across TVM, and draft some people for driving the RFCs for step 1 and 3 as this is a lot of work. My plan is to try and upstream machinery for doing 2 based on our discussion here, and then get help converting existing passes and errors to use this machinery.

Look forward to hearing everyone’s thoughts!

Jared

tqchen · July 7, 2020, 2:32am

I like the idea to introduce ICHECK, would be interesting to hear about others’ thoughts able the naming convention, e.g. shall we go with ICHECK/ILOG(FATAL) for internal?

xqdan · July 7, 2020, 5:10am

One big issue is we often lost when calling between C++ and python via call packed function, do we have any idea to improve it?

ANSHUMAN.TRIPATHY · July 8, 2020, 3:49pm

@jroesch: Thanks a lot for the RFC! This is a huge and ground breaking work indeed. Been expecting such feature in TVM quite some time from a compiler perspective.

Below are some thoughts if those are helpful:

In comparison to Debug or Error logs, Info logs also stands to help user. So may be we encourage members to provide more functional logging rather than simple check messages. It helps user to understand the internals of compilation and control it accordingly.

From Relay programming perspective, we need to have standard error codes and messaging. Similar to what you have shared in your rust example.

We should support verbose mode with file logging as well.

Thanks!

comaniac · July 9, 2020, 12:02am

Thanks for the RFC! I like the idea of ICHECK and I believe this will definitely help improve the user experience.

In addition, as others have already mentioned, would we also consider a more powerful logging system as well? Currently we only have LOG(INFO), LOG(WARNING), and LOG(FATAL). However, LOG(WARNING) and LOG(INFO) are at the save level, meaning that we cannot turn off LOG(INFO) and only keep LOG(WARNING). Shall we introduce at least 4 logging levels (i.e., DEBUG, INFO, WARNING, FATAL) so that we can better manage the output messages?

Along with the above concern, it seems like ICHECK could be a macro of “throwing LOG(FATAL) if check failed”.

kparzysz · July 9, 2020, 4:08pm

Internal errors and user errors differ fundamentally in that user errors can be corrected by the user, while internal errors cannot. There are still ways to make TVM crash by giving it invalid input, and all of those failed CHECKs should be converted to user-friendly messages that help the user fix the problem.

On the other hand, internal errors are consequences of a problem in the TVM code itself, and there is not much that the user can do about it. From a compiler developer’s point of view, an ideal bug report is one that shows how to reproduce the problem. In the absence of that, the more information the better. This is why showing stack traces or other internal information is highly relevant and useful, since the user can attach this information to the bug report.

junrushao · July 9, 2020, 6:25pm

Thank you for the RFC, Jared! I like the idea of having ICHECK as internal checking, to make sure there is clear separation between users’ error and TVM’s internal error.

I also like the point brought up by @kparzysz that “an ideal bug report is one that shows how to reproduce the problem”.

What I did in target string parsing is that we have a big try/catch, if a parsing error happens, whatever it is, we add the target string inputs to the error message and re-throw. This way we make sure the wrong user’s inputs are visible in the error message so that it is easier to understand and fix.

I am thinking about if we can generalize the idea and make sure

there is clear boundary between users’ error and TVM’s internal error: imagine we have a check pass for user’s error, after this pass, all errors are internal ones;
once it detects a user’s error, print the exact place and reason where and why it is wrong, and potentially offer some suggestions for correction;
if it is an internal error, it will be great if we can get back to user’s input and print it out as a reproducible example (it is not always possible because sometimes it is mutated), or ask users to provide a minimal reproducible example instead.

nuyensm · July 9, 2020, 10:05pm

@jroesch Thanks for this RFC. This will be a great usability improvement for TVM. One other feature that I would like to be included in this overhaul is the ability to not print out the whole Relay program when printing out errors on specific lines. It is not always desirable to expose that information, and it can get in the way of finding the errors.

Thanks

mbaret · July 9, 2020, 10:38pm

Are there any thoughts on what we could do to standardise the messages themselves, maybe with error codes? This would help if we’re interested in localisation.

jroesch · July 22, 2020, 10:18pm

Yeah great point, the original Relay error reporting was meant to be a temporary solution but I have mostly been working alone on error reporting the last few years and I have found some time and energy to work on it again

jroesch · July 22, 2020, 10:18pm

I was going to evolve the diagnostic context to take an error code and map this back to a table of messages. I would love some ideas on how to best design this mechanism.

jroesch · July 22, 2020, 10:20pm

Yeah I would love to also improve logging, for example using a modern structured logging approach would be a huge improvement from what we have today. It was unclear from my research what is the right way to do this in C++. Python has decent structured logging support, but I’m not sure how to best do it in C++.