This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Inconsistent behavior when debugging

I'm not even exactly sure how to describe this problem, so please bear with me.  I have a setup with two nRF52840s, one of them (call it B) functions as a peripheral server with two available connections, one for phone/host, one for the other unit, and the other (call it A) function as a central device with respect to the other unit and a peripheral device to the phone/host.  I am having an issue where during a coordinated activity (the A unit and B unit play some audio via I2S after some signaling from the A unit) the B unit somewhat mysteriously seems to reset - but I never see it in the debugger.  I'm using SES, and even though it's clear the the unit resets, I don't catch anything with the debugger.

So I decided to turn to Ozone since it is more powerful.  I am pretty sure that the problem is somewhere in my I2S code.  I have two buffers, one in RAM6 and one in RAM7, each 2176 bytes (the size of a page of the flash IC), in order to prevent bus contention.  I fill both buffers before starting I2S, then I fill the released buffer using SPI from an external flash IC while the other one is playing.  I decided to cordon off what I think is the offending code with a degenerate counter:

if (!p_released->p_tx_buffer)
{
    // Code to assign next buffers
}
else
{
    // The driver has just finished accessing the buffers pointed by
    // 'p_released'. They can be used for the next part of the transfer
    // that will be scheduled now.

    if (++count != SOME_VALUE)
    {
        // Set next buffers and start filling released buffer
    }
    else
    {
        // Set next buffers and start filling released buffer
        // Identical to above
    }
}

Here is the behavior I am seeing:

  • If I run without any breakpoints, I get about a second of audio, something like 60 buffers' worth, then a reset of the B unit - I hit no breakpoints (including vector catch on reset) in any fault or error handlers and the debugger is disconnected with an error message in Ozone telling me that the debugger was disconnected.  In SES I see nothing at all even though the unit has clearly reset.  Here are screenshots:    
  • If I put a breakpoint in the "else" block above, the audio plays fine until I hit the counter, can be many seconds of audio.  If I use SES it will play all the way to the end of the audio clip.
  • If I put a conditional breakpoint in the "if" block above and set a condition like "count == 200", the application enters the error handler (the default SDK app_error_fault_handler) with case NRF_FAULT_ID_SDK_ERROR (id = 0x4001), but there is no file name in the memory pointed to by p_info->p_file_name and line number is clearly invalid (values in the billions).

I'm not really sure what to do with this behavior.  Clearly something is making the application work when I debug it a certain way, but I'm not sure how to figure out what that is.  I am almost certain that at some point I have some sort of memory contention, but without being able to recreate it reliably and see what the values are at given breaks, it is very challenging to troubleshoot.  Has anyone else seen similar?  What did you do to get past it?

**EDIT**

I still am interested in how one might debug a situation like this, but I am starting to have a gut feeling.  Is it possible that the I2S hasn't actually released the buffer pointed to by p_released at the time of the interrupt handler?  I vaguely recall having seen someone mention this somewhere, but now I can't find it.  If this is the case, it could be the source of our problem since I kick off filling the next buffer pretty quickly in the callback.

Parents
  • Hello,

    I'm afraid I can't think of any good explanations as to why the devices would behave differently depending on where you placed the breakpoints. It's also uncommon to loose the debugger connection to the nRF device unless there is a HW problem (drop in supply voltage, bad connection, etc). For memory corruption bugs I'd usually expect to see either a hardfault exception, or worst case, a CPU lockup reset (you should have gotten the vector catch in that case).

    If you haven't already, I would suggest that you try to monitor the RESETREAS register on boot to figure out what the reset source is.

    Best regards,

    Vidar

  • Thank you, this ended up pretty much getting me there.  We definitely ran into a voltage drop when the hardware was playing audio (amplifier IC draws around 2.5W) that must have been causing the debugger to drop the connection and the system to reset.  I will say though that I was never able to check RESETREAS for this, all the registers always came up as 0xAAAAAAAA and 0xDEADBEEF for the CPU regs when I would try to connect after reset.  I suspect that the J-Link is having trouble reconnecting to the unit after the voltage drop event for some reason.

    Which brings me to a follow-up question that I am happy to put in its own thread if you like.  We have a handful of devices (I think 6 or 8 of them now) that acted funny, and now function perfectly well, except that they only connect and pair to other devices if they are literally less than about 2cm apart, otherwise they do not connect at all.  Everything else functions as expected and erasing them doesn't seem to have any effect.  Have you ever seen this before?  My guess is that they were subject to this voltage drop condition at some point and it somehow affected only that part of the device that supplies power to the antenna?  Is this possible?  It seems weird because I would be okay with the idea if it was a voltage spike (up), but a voltage sag/drop seems like it shouldn't have such an effect.  Yet it seems to be somewhat repeatable for us.  I think with a little work, I could maybe adapt a firmware that breaks nRF52840s in this manner.

  • Thanks for the update. Good to hear that you found the problem. To read the RESETREAS register after the reset you must first tell the debugger to re-connect because it doesn't know that the connection has been lost. You can use nrfjprog for this.

    Reading RESETREAS register after reset

    nrfjprog --memrd 0x40000400 // Should return 0x0 if you encountered a brownout reset

    With regards to your follow up question, do I understand it correctly that you achieve a more normal range as soon as you have managed to establish the connection and completed the pairing, or do you still have to keep them 2cm apart? I can't say I have seen that before. Are there any additional noise sources during this phase like high speed bus communication (SPI, UART, PWM, etc) that could possibly interfere with the RADIO?

  • Thanks Vidar!  The 2cm behavior seems to persist forever with the device, regardless of what we do to it.  Even if I erase the device and just run ble_app_blinky, we get the same behavior.  And yes they only work within 2cm even after pairing - if I move them apart they lose connection and start advertising again.

  • Thanks for confirming. Because this is sounding more like a HW issue  I think it's better if you create a new ticket. My more HW oriented colleagues can give better advice when it comes to troubleshooting of poor radio performance.

  • I never had a need to make a new ticket because we (I think) found the problem.  For the sake of posterity:

    We have an external 3V3 that supplies an audio amp that requires enough power that it sometimes pulls the battery voltage down.  This was causing the reset, but it was also causing the 3v3 regulator to go into boost mode.  Then the battery would quickly rebound and the regulator would take a moment to switch back to buck mode.  In the meantime it would cause a voltage spike over nRF52840 VDDH spec, which in turn must have done something to sometimes irrevocably latch the antenna into low power mode.  Since it only happens rarely, it was very difficult to find.

    Thanks again for your help!

Reply
  • I never had a need to make a new ticket because we (I think) found the problem.  For the sake of posterity:

    We have an external 3V3 that supplies an audio amp that requires enough power that it sometimes pulls the battery voltage down.  This was causing the reset, but it was also causing the 3v3 regulator to go into boost mode.  Then the battery would quickly rebound and the regulator would take a moment to switch back to buck mode.  In the meantime it would cause a voltage spike over nRF52840 VDDH spec, which in turn must have done something to sometimes irrevocably latch the antenna into low power mode.  Since it only happens rarely, it was very difficult to find.

    Thanks again for your help!

Children
No Data
Related