How to debug crashes

Hi!

How do you recommend to debug crashes?

Especially how do you recommend to debug crashes that happen to our customers that we cannot reproduce?


So far I have been trying to:

I know I can use arm-none-eabi-addr2line to get a line of the crash. But that usually leads to some generic assert in zephyr's internals. 

  • That's not sufficient to know how zephyr got there (i.e., what I have called before the crash happened).
  • There is usually no comment that would explain why the assert is in place and what may have caused it to fail.

I have been trying to get zephyr to print a stack trace. Unsuccessfuly. According to other devzone threads there is no way to do that.

As to attaching a debuger: my attempts so far haven't been successful. Vscode behaves as if debugger was attached and the code running, but the device acts as dead. In any case, debugger doesn't help with debugging customer's problems, and even for my local development is very inconvenient, so it hasn't very high priority for me.


So far the least disfunctional approach I have found is tracing the stack by placing printlines into the code, which is painfully slow process, as well as unreliable (since printlines add delays into the code that often make problems dissapear).

#frustration#crashes

  • An assert is a good start. Not everyone agrees with me, but I think it is a good idea to have asserts enabled also in released applications.

    By using asserts also in your application (for example to ensure pointers are not NULL), it will be quicker for you to nail down problems. Then you will perhaps catch them before Zephyr and/or other frameworks you use do. That said, even though the assert is generic it can usually give some hint.

    A good practice for embedded applications is to have logging in place and log what is going on. Log levels should be used so that it is easy for everyone to distinguish between information messages, warnings and errors.

    For applications running in the field it is usually not possible to watch every logged message in real time. What instead can be done is to write log messages to a ring buffer stored in a section in RAM that is not cleared by the bootloader or the application. If the application detects that it did not shutdown in a controlled way, it reports the reason as to why it rebooted as well as the content of the ring buffer. Hopefully there is some way for the development team to get access to at least these crash reports.

    Also the tracing framework, which is part of Zephyr, could be worth looking into, but I have not used it myself.

Related