This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Wierd hardfault

I have a bit of an odd one, and would appreciate any input.  We have a fairly complex project, which does various things, but some of the time, I get a hardfault.  The issue appears to be a instruction access violation (MMFSR->IACCVIOL = 1).  Working backwards, the code that caused this seems to be the nrfx_coredep_delay_us function (in nrfx_coredep.h).

We're using the 17.0.2 SDK, and the code in question is this, and the PC points to delay_machine_code:

   __ALIGN(16)
    static const uint16_t delay_machine_code[] = {
        0x3800 + NRFX_COREDEP_DELAY_US_LOOP_CYCLES, // SUBS r0, #loop_cycles
        0xd8fd, // BHI .-2
        0x4770  // BX LR
    };

    typedef void (* delay_func_t)(uint32_t);
    const delay_func_t delay_cycles =
        // Set LSB to 1 to execute the code in the Thumb mode.
        (delay_func_t)((((uint32_t)delay_machine_code) | 1));
    uint32_t cycles = time_us * NRFX_DELAY_CPU_FREQ_MHZ;
    delay_cycles(cycles);

I wasn't entirely sure about the way the code is hand-loaded in, so I tried switching to using the DWT instead (since the 53833 supports that).  This produced the same hardfault behaviour, only now it's on

while ((DWT->CYCCNT - cyccnt_initial) < time_cycles)
    {}

I suspect there's something else going on here.  Our code does use nrf_delay_ms for certain things that need busy waits, which calls through to the above code, but it only hardfaults some of the time.  As above, we're using SDK 17.0.2, nrf52833 on a custom board.

I'm at a bit of a loss on this one, suggestions welcome?

  • Assuming you haven't tried the workarounds for the LR corruption bug try this instead of just WFI, maybe insert immediately before the WFI:

    #if defined(SOFTDEVICE_PRESENT) &&  (NRF_SD_BLE_API_VERSION>=7)
        // Now using sd_app_evt_wait() which has the same workaround; it is softdevice dependant when BLE is used
        sd_app_evt_wait();
    #else
        // Errata 220: CPU: RAM is not ready when written - Disable IRQ while using WFE
        // Symptoms - Memory is not written in the first cycle after wake-up
        // Consequences - The address of the next instruction is not written to the stack. In stack frame, the link register is corrupted
        // Workaround
        // ==========
        // Enable SEVONPEND to disable interrupts so the internal events that generate the interrupt cause wakeuup in __WFE context and not in interrupt context
        // Before: ENABLE_WAKEUP_SOURCE -> __WFE -> WAKEUP_SOURCE_ISR -> CONTINUE_FROM_ISR  next line of __WFE
        // After:  ENABLE_WAKEUP_SOURCE -> SEVONPEND -> DISABLE_INTERRUPTS -> __WFE -> WAKEUP inside __WFE -> ENABLE_interrupts -> WAKEUP_SOURCE_ISR
        // Applications must not modify the SEVONPEND flag in the SCR register when running in priority levels higher than 6 (priority level numerical
        // values lower than 6) as this can lead to undefined behavior with SoftDevice enabled
        //
        // Errata 75: MWU: Increased current consumption
        // This has to be handled by turning off MWU but it is used in SoftDevice
        // see https://infocenter.nordicsemi.com/index.jsp?topic=%2Ferrata_nRF52832_EngB%2FERR%2FnRF52832%2FEngineeringB%2Flatest%2Fanomaly_832_75.html
        //
        // Errata 220: Enable SEVONPEND
        SCB->SCR |= SCB_SCR_SEVONPEND_Msk;
        __disable_irq();
        // Errata 75: MWU Disable
        uint32_t MWU_AccessWatchMask = NRF_MWU->REGIONEN & MWU_ACCESS_WATCH_MASK;
        // Handle MNU if any areas are enabled
        if (MWU_AccessWatchMask)
        {
            NRF_MWU->REGIONENCLR = MWU_AccessWatchMask; // Disable write access watch in region[0] and PREGION[0]
            __WFE();
            __WFE();
            __NOP(); __NOP(); __NOP(); __NOP();
            // Errata 75: MWU Enable
            NRF_MWU->REGIONENSET = MWU_AccessWatchMask;
        }
        else
        {
            __WFE();
            __WFE();
            __NOP(); __NOP(); __NOP(); __NOP();
        }
        __enable_irq();
    #endif
    

  • Hi,

    I was a bit confused too, I thought Ozone was displaying the stack trace in a reversed order, because it didn't match your observation otherwise. But I should have realized it was wrong after having seen the stack frame addresses. Anyway. It's good to have cleared that up.

    alison.lloyd said:
    Commenting it out made it obvious, as then the PC ends up pointing to the nrf_gpio_pin_* functions, which are setting the LEDs.  This also caused the resultant hardfault to change to an IBUSERR instead - the call stack included a call to __WFI(), which is slightly different.

    I think the the address/fault change is a result of the project being re-built with slighlty modified code. I don't see any other reason for why adding clearing of the FPU pending bit should make a difference. Edit: maybe you see the same effect if you change the compiler optimization settings?

    I see you have tagged the post with 'nRF52833', can you confirm that this is the chip you are using? I'm asking because if you are using 52832, you may be impacted by errata 220 as  suggested.

    Edit:

    The Hardfault library stack frame snapshot now looks like this:

    The last byte of the PSR register (Interrupt Program Status Register field) indicate the ISR number if the program was running in an interrupt context. Since it reads zero it does not appear to have been in an interrupt handler before the fault occurred.

  • Hi Vidar and hmolesworth,

    We're definitely using the '833 - NRF52833-CJAA-A0 if it makes any difference.  https://infocenter.nordicsemi.com/topic/errata_nRF52833_Rev1/ERR/nRF52833/Rev1/latest/err_833_new.html suggests that errata 220 doesn't apply, but I'll give the code you've suggested a go when I'm next in front of the board. 

    Errata 87 does apply, however, hence the code snippet above.

    The debug seems to be pointing to the __WFI() call causing the issues here, although that may be masking something else.  That would be consistent with the issue not showing up during debug, as WFI/WFE behaviour is presumably slightly different when a debugger is attached, to not cause dropouts.

    On a related note, my understanding of how the ARM core in the nRF52 is used is that the WFI and WFE instructions are almost the same, with the difference that one normally uses WFE with SEV to ensure the core actually goes to sleep:

    __SEV();
    __WFE(); // clear SEV bit
    __WFE(); // actually go to sleep

    Further reading here: https://developer.arm.com/documentation/dui0553/a/ .

    Is this correct?  Is there a good reason to use WFE instead of WFI in the Nordic chips?

  • Ok, I've reduced hmolesworth's suggested code to:

      SCB->SCR |= SCB_SCR_SEVONPEND_Msk;
      __disable_irq();
      __WFE();
      __WFE();
      __NOP(); __NOP(); __NOP(); __NOP();
      __enable_irq();

    so we're just looking at the errata 220 stuff.  This doesn't produce a noticable difference in behaviour - we're back to an IACCVIOL hardfault, but the behaviour is otherwise the same.  The LR and top of stack return point to probably-junk places outside normal values.

  • Hi,

    Thanks for confirming. So the 52833 is not impacted by errata 220. Though I expect the workaround for it could change the behaviour slightly like workaround for errata 87 did if the problem is indeed timing related.

    Could you please try to put the chip in constant latency mode on startup and see if it makes any difference? Constant latency mode (see Sub-power modes) should have similiar effect on interrupt latency as as putting the chip in debug mode has (both modes keep the HF clock running in sleep).

    e.g.

    void main(void)
    {
        NRF_POWER->TASKS_CONSTLAT = 1;

    With regards to __WFI vs__WFE, I think Anders gives a good summary of it in this post: https://devzone.nordicsemi.com/f/nordic-q-a/490/how-do-you-put-the-nrf51822-chip-to-sleep/2571#2571. The main difference is basically that __WFE enters sleep conditionally based on the internal event register, while __WFI does not. The reason for using one over the other is if implemention will be subjected to race conditons or not.

    Edit: Would you be able to share the full source code here or in a private support ticket so I can try to debug it here, or does it require your custom HW to run it?

    Edit 2 :  Have you tested this on a nordic DK? I'm starting to think that this may be a HW issue. I'd suggest suggest to go over the decoupling caps used on your board and see if they match the values we recommend in our reference design here:  Reference circuitry

Related