Improving logging on Hexagon with proposed changes to LogMessageImpl

We occasionally run into strange crashes that are hard to debug. Unfortunately, the verbosity of logs on Hexagon makes them difficult to wade through, and sometimes causes log data to be dropped. For example, here’s a trace from a recent crash: Example TVM on Hexagon crash · GitHub

There’s several problems here. First, there is a lot of RPC debugging messages. The Hexagon RPC mechanism has been in use for quite a while now, so we probably don’t need in-depth logging most of the time. It also drowns out finding useful messages, and might even cause the logging buffers to fill up, causing data to be dropped. For example, in the last line of that gist, ./libhexagon_rpc.so is truncated as ./libhexagon_r and overwritten with dashes.

Another issue is that Hexagon’s FARF (logging) API automatically adds line numbers. When forwarding TVM logs to FARF, this results in overly verbose messages, with FARF showing the line number where custom Hexagon logging is implemented (in hexagon_common.cc), while the actual log gets split into multiple FARF log lines – one for the actual line number, and one for the actual message.

You can see this a lot in the gist, but I’ll repeat a couple of lines here:

10-10 23:07:27.397 10835 10839 V adsprpc : CDSP:/home/csullivan/projects/tvm/src/runtime/hexagon/hexagon_common.cc:73:0x32167:25605: INFO: /home/csullivan/projects/tvm/src/runtime/hexagon/rpc/hexagon/rpc_server.cc:77:
10-10 23:07:27.397 10835 10839 V adsprpc : CDSP:/home/csullivan/projects/tvm/src/runtime/hexagon/hexagon_common.cc:76:0x32167:25605: INFO: HexagonIOHandler PosixWrite called, write_len_bytes(4)

Finally, the current Hexagon logging implementation uses HEXAGON_PRINT(ALWAYS, ... regardless of the TVM LOG() level, making it impossible to filter out lower-level messages. This also presumably has a big impact on performance.

Unfortunately, LogMessageImpl doesn’t currently take the TVM log level, making it impossible to forward to the Hexagon logging framework. I have a draft PR that extends LogMessageImpl to take the TVM log level: https://github.com/apache/tvm/pull/13072 However, this seems like a big enough change to get some feedback on.

While working through this PR, several things came to mind:

  • Does it still make sense to have a separate LogMessage and LogFatal? LogFatal usually ends up doing the same thing LogMessage does, followed by something like an abort(). Now that LogMessage takes a level, it seems like these functions can be merged.

  • TVM doesn’t really have log levels per seINFO, WARNING, ERROR, and FATAL are just used to construct the string of the macro to call (e.g., (#define LOG(level) LOG_##level, followed by a bunch of different LOG_* macros that mostly do the same thing. These redundant macros could be removed if these levels actually had values to pass into LogMessage.

  • Meanwhile, metaschedule does have explicit log levels for PyLogMessage, which includes the extra DEBUG level, which is standard in Python but does not exist elsewhere in TVM at the moment. (There’s also the CLEAR level, which seems to be a hack, so I’m ignoring it for now.)

  • Each platform has its own logging levels. For example, Hexagon has LOW, MEDIUM, HIGH, ERROR, FATAL, and ALWAYS. Android has VERBOSE, DEBUG, INFO, INFO, WARN, ERROR, and FATAL. How do we reconcile these? It seems like mapping Hexagon LOW to DEBUG mostly keeps everything within a reasonable common range between LOW and FATAL.

The PR also uses the Hexagon HAP_debug_runtime API, and while the headers say you should use the FARF macros instead, it seems appropriate to use this lower-level API when combining integrating Hexagon logging with another logging framework. This enables you to adjust the desired logging level at runtime. According to the Hexagon SDK, runtime log filtering has negligible overhead for discarded messages. With this change, I made Hexagon RPC log at the DEBUG level and disabled collection of those logs by adjusting the flags in tvm_rpc_android.farf. The logs are now MUCH nicer. For example:

04-22 19:04:16.232 20684 20688 V adsprpc : rpc_server.cc:199:0xd379:8709: ERROR: ProcessOnePacket failed
04-22 19:04:16.233 20684 20688 V adsprpc : rpc_server.cc:290:0xd379:8709: ERROR: hexagon_rpc_send failed, written_size (-1) != dataLen (12)

I also took this opportunity to pipe the TVM log level to Android. Previously, TVM on Android used to log everything in logcat at the DEBUG level. Now the TVM log level is mapped to the corresponding Android log level and sent to logcat.

Thoughts on this approach?

You might be interested in VLOG too. Seems like it might fit your use case better. Debugging TVM — tvm 0.11.dev0 documentation

Hexagon doesn’t have the concept of environment variables, so VLOG won’t work as-is. Perhaps we could set the VLOG spec through an RPC call. I was just surprised that the LOG level doesn’t seem to do much of anything unless it’s FATAL. Fixing the the file name/line number issue is what caused me to pursue this route, but that’s an orthogonal problem.

Would LogMessage call abort() given the FATAL log level? If so there may be some concern with unintended consequences when calling LogMessage() which I would never expect to call abort() (However, I could say the same thing for LogFatal so…)

I think mapping these together accomplishes the goal and would be a good solution. I can’t think of any reason that this would cause issues.

Yes, the idea would be that LogMessage would call abort() if passed the FATAL level.

I guess one advantage the current implementation has is that it allows LogFatal to be marked as [[noreturn]], presumably letting the compiler further optimize code based on that fact and/or not throwing compilation warnings/errors.