This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Wierd hardfault

I have a bit of an odd one, and would appreciate any input.  We have a fairly complex project, which does various things, but some of the time, I get a hardfault.  The issue appears to be a instruction access violation (MMFSR->IACCVIOL = 1).  Working backwards, the code that caused this seems to be the nrfx_coredep_delay_us function (in nrfx_coredep.h).

We're using the 17.0.2 SDK, and the code in question is this, and the PC points to delay_machine_code:

   __ALIGN(16)
    static const uint16_t delay_machine_code[] = {
        0x3800 + NRFX_COREDEP_DELAY_US_LOOP_CYCLES, // SUBS r0, #loop_cycles
        0xd8fd, // BHI .-2
        0x4770  // BX LR
    };

    typedef void (* delay_func_t)(uint32_t);
    const delay_func_t delay_cycles =
        // Set LSB to 1 to execute the code in the Thumb mode.
        (delay_func_t)((((uint32_t)delay_machine_code) | 1));
    uint32_t cycles = time_us * NRFX_DELAY_CPU_FREQ_MHZ;
    delay_cycles(cycles);

I wasn't entirely sure about the way the code is hand-loaded in, so I tried switching to using the DWT instead (since the 53833 supports that).  This produced the same hardfault behaviour, only now it's on

while ((DWT->CYCCNT - cyccnt_initial) < time_cycles)
    {}

I suspect there's something else going on here.  Our code does use nrf_delay_ms for certain things that need busy waits, which calls through to the above code, but it only hardfaults some of the time.  As above, we're using SDK 17.0.2, nrf52833 on a custom board.

I'm at a bit of a loss on this one, suggestions welcome?

Parents
  • Hello,

    Not sure if I have encountered this type of fault before, but I found a blog post that suggests that this error is raised when the CPU attempts to execute code from a memory protected region.

    IACCVIOL - Indicates that an attempt to execute an instruction triggered an MPU or Execute Never (XN) fault. We’ll explore an example below.

     https://interrupt.memfault.com/blog/cortex-m-fault-debug

    Could this be it, or are you not using the MPU in your project?

    Best regards,

    Vidar

  • Hi Vidar,

    Thanks for the article link, I'd actually already run across it (it's super useful for figuring out the debug). 

    We are not directly using either the Cortex MPU or the Nordic MWU.  Are either of those used by any Nordic libraries or drivers?

    Best,

      Alison

  • Assuming you haven't tried the workarounds for the LR corruption bug try this instead of just WFI, maybe insert immediately before the WFI:

    #if defined(SOFTDEVICE_PRESENT) &&  (NRF_SD_BLE_API_VERSION>=7)
        // Now using sd_app_evt_wait() which has the same workaround; it is softdevice dependant when BLE is used
        sd_app_evt_wait();
    #else
        // Errata 220: CPU: RAM is not ready when written - Disable IRQ while using WFE
        // Symptoms - Memory is not written in the first cycle after wake-up
        // Consequences - The address of the next instruction is not written to the stack. In stack frame, the link register is corrupted
        // Workaround
        // ==========
        // Enable SEVONPEND to disable interrupts so the internal events that generate the interrupt cause wakeuup in __WFE context and not in interrupt context
        // Before: ENABLE_WAKEUP_SOURCE -> __WFE -> WAKEUP_SOURCE_ISR -> CONTINUE_FROM_ISR  next line of __WFE
        // After:  ENABLE_WAKEUP_SOURCE -> SEVONPEND -> DISABLE_INTERRUPTS -> __WFE -> WAKEUP inside __WFE -> ENABLE_interrupts -> WAKEUP_SOURCE_ISR
        // Applications must not modify the SEVONPEND flag in the SCR register when running in priority levels higher than 6 (priority level numerical
        // values lower than 6) as this can lead to undefined behavior with SoftDevice enabled
        //
        // Errata 75: MWU: Increased current consumption
        // This has to be handled by turning off MWU but it is used in SoftDevice
        // see https://infocenter.nordicsemi.com/index.jsp?topic=%2Ferrata_nRF52832_EngB%2FERR%2FnRF52832%2FEngineeringB%2Flatest%2Fanomaly_832_75.html
        //
        // Errata 220: Enable SEVONPEND
        SCB->SCR |= SCB_SCR_SEVONPEND_Msk;
        __disable_irq();
        // Errata 75: MWU Disable
        uint32_t MWU_AccessWatchMask = NRF_MWU->REGIONEN & MWU_ACCESS_WATCH_MASK;
        // Handle MNU if any areas are enabled
        if (MWU_AccessWatchMask)
        {
            NRF_MWU->REGIONENCLR = MWU_AccessWatchMask; // Disable write access watch in region[0] and PREGION[0]
            __WFE();
            __WFE();
            __NOP(); __NOP(); __NOP(); __NOP();
            // Errata 75: MWU Enable
            NRF_MWU->REGIONENSET = MWU_AccessWatchMask;
        }
        else
        {
            __WFE();
            __WFE();
            __NOP(); __NOP(); __NOP(); __NOP();
        }
        __enable_irq();
    #endif
    

  • Hi,

    I was a bit confused too, I thought Ozone was displaying the stack trace in a reversed order, because it didn't match your observation otherwise. But I should have realized it was wrong after having seen the stack frame addresses. Anyway. It's good to have cleared that up.

    alison.lloyd said:
    Commenting it out made it obvious, as then the PC ends up pointing to the nrf_gpio_pin_* functions, which are setting the LEDs.  This also caused the resultant hardfault to change to an IBUSERR instead - the call stack included a call to __WFI(), which is slightly different.

    I think the the address/fault change is a result of the project being re-built with slighlty modified code. I don't see any other reason for why adding clearing of the FPU pending bit should make a difference. Edit: maybe you see the same effect if you change the compiler optimization settings?

    I see you have tagged the post with 'nRF52833', can you confirm that this is the chip you are using? I'm asking because if you are using 52832, you may be impacted by errata 220 as  suggested.

    Edit:

    The Hardfault library stack frame snapshot now looks like this:

    The last byte of the PSR register (Interrupt Program Status Register field) indicate the ISR number if the program was running in an interrupt context. Since it reads zero it does not appear to have been in an interrupt handler before the fault occurred.

  • Hi Vidar and hmolesworth,

    We're definitely using the '833 - NRF52833-CJAA-A0 if it makes any difference.  https://infocenter.nordicsemi.com/topic/errata_nRF52833_Rev1/ERR/nRF52833/Rev1/latest/err_833_new.html suggests that errata 220 doesn't apply, but I'll give the code you've suggested a go when I'm next in front of the board. 

    Errata 87 does apply, however, hence the code snippet above.

    The debug seems to be pointing to the __WFI() call causing the issues here, although that may be masking something else.  That would be consistent with the issue not showing up during debug, as WFI/WFE behaviour is presumably slightly different when a debugger is attached, to not cause dropouts.

    On a related note, my understanding of how the ARM core in the nRF52 is used is that the WFI and WFE instructions are almost the same, with the difference that one normally uses WFE with SEV to ensure the core actually goes to sleep:

    __SEV();
    __WFE(); // clear SEV bit
    __WFE(); // actually go to sleep

    Further reading here: https://developer.arm.com/documentation/dui0553/a/ .

    Is this correct?  Is there a good reason to use WFE instead of WFI in the Nordic chips?

  • Ok, I've reduced hmolesworth's suggested code to:

      SCB->SCR |= SCB_SCR_SEVONPEND_Msk;
      __disable_irq();
      __WFE();
      __WFE();
      __NOP(); __NOP(); __NOP(); __NOP();
      __enable_irq();

    so we're just looking at the errata 220 stuff.  This doesn't produce a noticable difference in behaviour - we're back to an IACCVIOL hardfault, but the behaviour is otherwise the same.  The LR and top of stack return point to probably-junk places outside normal values.

  • Hi,

    Thanks for confirming. So the 52833 is not impacted by errata 220. Though I expect the workaround for it could change the behaviour slightly like workaround for errata 87 did if the problem is indeed timing related.

    Could you please try to put the chip in constant latency mode on startup and see if it makes any difference? Constant latency mode (see Sub-power modes) should have similiar effect on interrupt latency as as putting the chip in debug mode has (both modes keep the HF clock running in sleep).

    e.g.

    void main(void)
    {
        NRF_POWER->TASKS_CONSTLAT = 1;

    With regards to __WFI vs__WFE, I think Anders gives a good summary of it in this post: https://devzone.nordicsemi.com/f/nordic-q-a/490/how-do-you-put-the-nrf51822-chip-to-sleep/2571#2571. The main difference is basically that __WFE enters sleep conditionally based on the internal event register, while __WFI does not. The reason for using one over the other is if implemention will be subjected to race conditons or not.

    Edit: Would you be able to share the full source code here or in a private support ticket so I can try to debug it here, or does it require your custom HW to run it?

    Edit 2 :  Have you tested this on a nordic DK? I'm starting to think that this may be a HW issue. I'd suggest suggest to go over the decoupling caps used on your board and see if they match the values we recommend in our reference design here:  Reference circuitry

Reply
  • Hi,

    Thanks for confirming. So the 52833 is not impacted by errata 220. Though I expect the workaround for it could change the behaviour slightly like workaround for errata 87 did if the problem is indeed timing related.

    Could you please try to put the chip in constant latency mode on startup and see if it makes any difference? Constant latency mode (see Sub-power modes) should have similiar effect on interrupt latency as as putting the chip in debug mode has (both modes keep the HF clock running in sleep).

    e.g.

    void main(void)
    {
        NRF_POWER->TASKS_CONSTLAT = 1;

    With regards to __WFI vs__WFE, I think Anders gives a good summary of it in this post: https://devzone.nordicsemi.com/f/nordic-q-a/490/how-do-you-put-the-nrf51822-chip-to-sleep/2571#2571. The main difference is basically that __WFE enters sleep conditionally based on the internal event register, while __WFI does not. The reason for using one over the other is if implemention will be subjected to race conditons or not.

    Edit: Would you be able to share the full source code here or in a private support ticket so I can try to debug it here, or does it require your custom HW to run it?

    Edit 2 :  Have you tested this on a nordic DK? I'm starting to think that this may be a HW issue. I'd suggest suggest to go over the decoupling caps used on your board and see if they match the values we recommend in our reference design here:  Reference circuitry

Children
  • Hi Vidar,

    Although our original design was incorrect with regard to the decoupling layout, this board has been updated to match configuration 6 WLCSP from the datasheet (and your link).  However, on careful examination, it looks like the cap fitted to DEC1 (C4 in the reference) is considerably too large - 2.2uF instead of 100nF.  I'll get our electronics chap to break out the fine soldering iron and change it, and let you know if that makes any difference.

    Also of note, we're running without any crystals - as we're not doing any BLE etc, the timing accuracy isn't critical, and fewer external components is hugely helpful in our application.  We're using the internal RCs for both HF and LF clocks.

    Putting the chip into constant latency mode makes the hardfault problem go away (or, at least, I haven't been able to reproduce it now).

    I can provide the code and schematic privately, although most of the code won't run without the custom hardware.  I will also need to excise some third-party code, as we're not authorised to share that, which will of course change the behaviour and timing somewhat.  Running on a DK has similar issues, although we can at least bolt on a lot of the necessary bits externally.  I'll have a look at porting the codebase over if changing the DEC1 cap doesn't improve things.

  • Hi Alison,

    I have actually seen reports of similiar behaviour before for boards that have been fitted with a too large capacitor on DEC1, so I think we may finally have found a root cause. I hope this is it.

    Putting the chip into constant latency mode makes the hardfault problem go away (or, at least, I haven't been able to reproduce it now).

    I think a possible explanation for this is that constant latency prevents the system from entering its lowest power state (min. idle current is ~500 uA with constant latency enabled).

  • Hi Vidar,

    I've done a number of different tests across several boards, and the DEC1 decoupling cap looks like our answer.  Dropping it to the 100nF it was supposed to be seems to have made all the hardfaults go away completely.

    I guess that the issue was more obvious at lower powers, possibly due to the cap not charging / discharging fast enough, and causing clock / rail skew, to the point where the core got really unhappy.  Hence always happening after a WFI, and being so sensitive to which peripherals were enabled or not. 

    Either way, excellent spot, and thank you for your help!  Beers on us next time you're in London...

  • Hi Alison,

    Thanks for the update! I'm glad to hear that it fixed the problem. Will definelty remember keep this in mind next time I encounter a hardfault like this. And I agree,  too long charging time of the cap seems to be the most likely explanation.

Related