How can one debug a Zephyr RTOS application which crashes only with the debugger detached?

This is a cross-post from my question on StackOverflow here. This might be a more appropriate channel for this question.

I have a custom board using the u-blox BMD-350 (Nordic nRF52382) chipset (as well as the u-blox development board, the problem occurs on both). I'm using Nordic's Connect SDK version 1.9.0 which uses Zephyr RTOS version v2.7.99-ncs1. Development environment and debugger is Visual Studio Code with nRF Connect extensions.

I can connect with a debugger via SWD and I can monitor things using either Segger RTT logging output or the Bluetooth telemetry it's transmitting. I'm running into a particularly difficult problem to debug, where at startup and for the first couple of seconds things are running smoothly (meaning I'm seeing Bluetooth telemetry come in at the expected rate), but then the device seems to crash and stops outputting Bluetooth telemetry. However, this does not happen if I'm connected with the debugger, and it also typically doesn't happen (or at least not very quickly) when I'm connected via RTT. When I connect to RTT after a crash, I don't really see any helpful feedback, just that the logging has stoped (often midway through a message). Of note, it also doesn't seem to crash unless there is a Bluetooth client connected. The Nordic NUS service is being used almost exactly as it is in the demo, except that the device is configured as a peripheral.

Since I'm not able to get a crash to happen in a debug session, I'm at a bit of a loss how to debug this issue. It doesn't seem to be generating a core dump or a usable error as far as I can see (I have the appropriate flags specified), so I wonder if there is a weird locking condition that's hosing the system (but only outside of the debugger or an active RTT session).

Any thoughts as to how one might go about debugging a bug like this?

Parents
  • Hello Anthony,

    Any thoughts as to how one might go about debugging a bug like this?

    You could try outputting something on the UART in order to get an understanding of the problem.

    But in this case I have a theory already: BMD-350 doesn't have a crystal on it. You need to configure the nRF52832 to use the internal RC.

    Regards,

    Elfving

  • I was really hoping this would work, easy fix. However, both the development kit and my custom board have crystal oscillators onboard. I did try other clock sources anyway, all yield the same behavior.

    The UART has also not been helpful. It will also stop after this failure mode. But I can still connect and disconnect from Bluetooth, there's just no longer bluetooth activity. It almost seems like the device is being put to sleep or something, but attempting to connect via RTT doesn't correct this issue. And again, with the RTT or debugger (SWD) connected in the first place, this issue will not arise.

    There doesn't seem to be a way (at least in VSCode or through west) to attach a debugger to chipset already running without restarting the MCU and losing that error state. Although if it is just putting itself to sleep, I might not see the error state anyway.

  • Hi Anthony, sorry about the wait.

    Anthony W said:
    However, both the development kit and my custom board have crystal oscillators onboard. I did try other clock sources anyway, all yield the same behavior.

    Did you try both high and low frequency clocks? It seems to me that the BMD-350 doesn't have an external low-frequency one. If your project isn't configured to use the internal RC, it might be an issue. You can do this by adding CONFIG_CLOCK_CONTROL_NRF_K32SRC_RC=y to your configuration file.

    If that doesn't help, have you tried our other examples as well? Does it for instance stop you from testing the peripheral uart sample

    If it occurs there as well, could you provide me a sniffer trace?

    Regards,

    Elfving

  • Actually, I finally found the issue yesterday, so I can (somewhat) fill in here. You're correct, the BMD-350 module does not have an external LF clock source, but both the dev board and the custom board do, accurate to 20ppm (on mine, not sure about the dev board). But that wasn't the issue. I based my code off of the peripheral uart sample actually, but I wasn't using UART at all. I removed all of the UART-specific implementation code (basically, just initialized BT and NUS) and the problem went away. I didn't debug exactly why that code was causing a failure (since it was unused in my application anyway and I don't care to keep it around) but I suspect it might be because the UART was not initialized (or even present on the custom board) and somehow this non-existent UART was otherwise interfering with BT operation. It seems like a stretch, and none of the code I removed should have been reached in operation, but removing it fixed the issue. So I guess it will remain a partial mystery for now.

    Thank you for your advice though, in particular pointing out the clock source. While it wasn't necessarily applicable to this issue I do have a greater appreciation for the importance of the clocks on BT operation.

  • Great!

    I do not have a good explanation for why the presence off the UART code would be a problem, but I have actually seen this being the case before. A quick search gave me for instance this case here, which looks very similar.

    Anthony W said:
    but I suspect it might be because the UART was not initialized (or even present on the custom board) and somehow this non-existent UART was otherwise interfering with BT operation.

    That might be. But in any case, I am glad you figured it out.

    Regards,

    Elfving

  • In fact, before removing the peripheral UART code, that "lfclk_spinwait" is exactly where I found it getting stuck, so something similar was going on I bet. Thanks for your help!

Reply Children
No Data
Related