Beware that this post is related to an SDK in maintenance mode
More Info: Consider nRF Connect SDK for new designs
This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Random bad instruction faults at 0x16274 with softdevice s132 version 7.0.1. SDK 16.0.0

Hello,

We are attempting to track down a rare bug in our code that causes the device to occasionally hard fault.

We have been able to catch it three times, twice where it attempted to execute an instruction at 0x80000000 and once at 0x60000000.
The state of the device is almost the same for all cases, with the working registers having the same values except R12 and R7.
Even the link register is the same for both with a return destination of 0x00016275.

Since the program is executing in the soft device region, could someone look to see what is happening in and around that region?
I am hoping it might give us some more insight into why these hard faults happen.
I checked and our application stack hasn't overflown.
Otherwise any insights or points of direction would be appreciated.

Thanks,

James


(Instruction Fault at 0x80000000)                                                                                               (Instruction Fault at 0x60000000)

Parents
  • Hello James,

    This looks very similar to errata 220: CPU: RAM is not ready when written, or have you considered it already? If it is, then the stacked register values you posted above will be for the idle loop inside the Softdevice (invoked when you call sd_app_evt_wait(), except for the corrupted PC that is a result of this HW issue. The reason for the other registers being valid is that they are always pushed onto the stack after the PC register.

    This is a very rare bug which only occur in implementations that manages to meet the exact timings outlined in the errata documented I linked to above.

    It should not be possible to trigger the issue while the chip is in debug interface mode either. I assume you connected the debugger after the crash when you read the stack?

    Best regards,

    Vidar

Reply
  • Hello James,

    This looks very similar to errata 220: CPU: RAM is not ready when written, or have you considered it already? If it is, then the stacked register values you posted above will be for the idle loop inside the Softdevice (invoked when you call sd_app_evt_wait(), except for the corrupted PC that is a result of this HW issue. The reason for the other registers being valid is that they are always pushed onto the stack after the PC register.

    This is a very rare bug which only occur in implementations that manages to meet the exact timings outlined in the errata documented I linked to above.

    It should not be possible to trigger the issue while the chip is in debug interface mode either. I assume you connected the debugger after the crash when you read the stack?

    Best regards,

    Vidar

Children
  • That errata definitively fits.
    It is exactly like you say where we have not been able to catch it with the debugger connected.
    Further, we are utilizing the TWI peripheral with DMA so our specific implementation fits.

    We used a service which saves a coredump to flash on a hardfault, then transmits it over BLE.
    That is how we were able to catch the bug in action.

    Please correct me if I am wrong.
    We will have to update the SDK to at least version 17.0.2, so we can use softdevice v8. Mainly because we don't have access to the loop within the softdevice to add the fix, and we need to utilize the fpu peripheral in our production code.

  • I think you can stay on the current SDK and Softdevice version if you do one of the following changes to the application code:

    1. Replace the sd_app_evt_wait() call in your app with the errata workaround. If you do this, the application will continue to enter System ON as before. The main difference is that sd_app_evt_wait() only returns on application interrupts, whereas with the workaround you will also get wakeups on Softdevice interrupts (RADIO, TIMER0, RTC0, etc).

    Another difference is that sd_app_evt_wait() implements the workaround for [75] MWU: Increased current consumption, so you would have to do the same to avoid a power penalty.

    The combined workaround for errata 220 and 75 may look something like this:

    /* Disable MWU during sleep as a workaround for errata #75: 
     * https://infocenter.nordicsemi.com/topic/errata_nRF52832_Rev3/ERR/nRF52832/Rev3/latest/anomaly_832_75.html
     */
    
    #define MWU_ENABLE()                                                                            \
    do                                                                                              \
    {                                                                                               \
        NRF_MWU->REGIONENSET = ((MWU_REGIONENSET_RGN0WA_Set << MWU_REGIONENSET_RGN0WA_Pos) |        \
        (MWU_REGIONENSET_PRGN0WA_Set << MWU_REGIONENSET_PRGN0WA_Pos));                              \
    }                                                                                               \
    while (0)
    
    #define MWU_DISABLE()                                                                           \
    do                                                                                              \
    {                                                                                               \
        NRF_MWU->REGIONENCLR = ((MWU_REGIONENSET_RGN0WA_Set << MWU_REGIONENSET_RGN0WA_Pos) |        \
        (MWU_REGIONENCLR_PRGN0WA_Clear << MWU_REGIONENCLR_PRGN0WA_Pos));                            \
    }                                                                                               \
    while (0)
    
    /* SEVONPEND must be enabled when calling this function */
    __STATIC_INLINE evt_wait(void)
    {
        __disable_irq();
        MWU_DISABLE();
        __WFE();
        __NOP();__NOP();__NOP();__NOP();
        __enable_irq();
        /* Note: mwu is enabled internally in the Softdevice during
         * forwarding of application interrutps 
         */
        MWU_ENABLE();
    }
    
    

    2. Disable stacking of FPU registers and only do float operations from a single task running in the main context. There is app note here from ARM that explains how you can configure this: https://developer.arm.com/documentation/dai0298/a/. But be aware that some compilers may in some cases utilize floating point registers as general purpose registers as discussed in this thread: https://github.com/zephyrproject-rtos/zephyr/issues/29590 

Related