This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Wierd hardfault

I have a bit of an odd one, and would appreciate any input.  We have a fairly complex project, which does various things, but some of the time, I get a hardfault.  The issue appears to be a instruction access violation (MMFSR->IACCVIOL = 1).  Working backwards, the code that caused this seems to be the nrfx_coredep_delay_us function (in nrfx_coredep.h).

We're using the 17.0.2 SDK, and the code in question is this, and the PC points to delay_machine_code:

   __ALIGN(16)
    static const uint16_t delay_machine_code[] = {
        0x3800 + NRFX_COREDEP_DELAY_US_LOOP_CYCLES, // SUBS r0, #loop_cycles
        0xd8fd, // BHI .-2
        0x4770  // BX LR
    };

    typedef void (* delay_func_t)(uint32_t);
    const delay_func_t delay_cycles =
        // Set LSB to 1 to execute the code in the Thumb mode.
        (delay_func_t)((((uint32_t)delay_machine_code) | 1));
    uint32_t cycles = time_us * NRFX_DELAY_CPU_FREQ_MHZ;
    delay_cycles(cycles);

I wasn't entirely sure about the way the code is hand-loaded in, so I tried switching to using the DWT instead (since the 53833 supports that).  This produced the same hardfault behaviour, only now it's on

while ((DWT->CYCCNT - cyccnt_initial) < time_cycles)
    {}

I suspect there's something else going on here.  Our code does use nrf_delay_ms for certain things that need busy waits, which calls through to the above code, but it only hardfaults some of the time.  As above, we're using SDK 17.0.2, nrf52833 on a custom board.

I'm at a bit of a loss on this one, suggestions welcome?

Parents
  • Hello,

    Not sure if I have encountered this type of fault before, but I found a blog post that suggests that this error is raised when the CPU attempts to execute code from a memory protected region.

    IACCVIOL - Indicates that an attempt to execute an instruction triggered an MPU or Execute Never (XN) fault. We’ll explore an example below.

     https://interrupt.memfault.com/blog/cortex-m-fault-debug

    Could this be it, or are you not using the MPU in your project?

    Best regards,

    Vidar

  • Hi Vidar,

    Thanks for the article link, I'd actually already run across it (it's super useful for figuring out the debug). 

    We are not directly using either the Cortex MPU or the Nordic MWU.  Are either of those used by any Nordic libraries or drivers?

    Best,

      Alison

  • Hi,

    Your understanding is correct, the pointer will be set to NULL if there is a stack overflow, but only if there is a stack overflow while the hardfault exception is raised. And a stack overrun doesn't necessarily trigger a fault immediately (if at all). To catch a stack overflow early you can use the Stack guard library or try to set a data breakpoint at the bottom of the stack.

    Setting data breakpoint in Ozone

    Did the hardfault handler print the debug messages? It would be interesting to see the CPU register values that were pushed on stack before the hardfault exception.

  • Hi Vidar,

    Unfortunately, having any sort of debugger attached while running prevents the hardfault showing up at all, which would tend to suggest some sort of timing issue.  Using a UART has the same effect.  I can attach to the running chip after the hardfault has happened, to see the state of things, but I can't watch the fault happen (which also means not getting the debug output).

    The stack frame passed to my HardFault_process function looks like this:

    And the CPU registers (according to Ozone) are:

    As the stack guard library uses the MPU, wouldn't it basically show the same symptom, i.e. an instruction access violation, if the stack was triggered? 

    I've had a go at the old-fashioned "write DEADBEEF to the stack at start-up" method, and that suggests that this code isn't actually over-running the stack.  At the point a hardfault has ocurred, connecting after the fact via Ozone shows that 0x2001E000 is all DEADBEEF still.  I've also confirmed again via MPU_CTRL that the MPU definitely doesn't think it's enabled.

    So, all in all, it doesn't look like stack overflow is the issue here.  Which brings us back to a wierd hardfault that doesn't seem to have any obvious cause, but occurs repeatably.

  • Hi,

    I agree, it doesn't sound like you have had stack overflow, but the PC value (0xFB04F85C) on stack is clearly invalid, so that must be the reason for the hardfault.

    It's intereseting that you're not able to replicate this in debug mode. Have you tried to comment the sleep/idle function in your application to see if that may have the same effect as being in debug mode?

  • Ok, the delay_ms call was a red herring, and obvious in retrospect - that's being called from the hardfault handler (to flash an LED), so it's not where the problem is ocurring.  Doh.

    Commenting it out made it obvious, as then the PC ends up pointing to the nrf_gpio_pin_* functions, which are setting the LEDs.  This also caused the resultant hardfault to change to an IBUSERR instead - the call stack included a call to __WFI(), which is slightly different.

    We do use __WFI() a fair bit to cause the CPU to sleep (and consume less power) when waiting for things to happen.  Most timing is provided via an RTC, plus a few external interrupts.  As above, we don't use a softdevice, so it shouldn't be a problem to call __WFI() directly, but a bit of digging around suggested that the following is better (we do use the FPU):

    static inline void sys_sleep (void)
    {
    #if (__FPU_USED == 1)
      __set_FPSCR(__get_FPSCR() & ~(0x0000009F)); 
      (void) __get_FPSCR();
      NVIC_ClearPendingIRQ(FPU_IRQn);
    #endif
      __WFI();
    }

    Replacing the WFI calls with that changed the hardfault to an UNDEFINSTR error instead.  Looking at the top of the call stack, I now seem to be somewhere in the heap, I think?

    So yes, that's probably not a valid instruction.  But how did we end up there in the first place?  That particular address happens to be the m_mem_pool long, from the Nordic mem_manager library (which we use to provide pseudo-dynamic memory allocation).  Looking through mem_manager.c, I don't see anything obvious that could turn into a "JMP &m_mem_pool" or similar - it's a status flags bitmap, so the usage is all setting and checking bits.

    The Hardfault library stack frame snapshot now looks like this:

    with the LR pointing to the WFI instruction in the above sys_sleep code.  So it seems to be something to do with the WFI call.

    Does anything in the above jump out at you as obviously suspicious or worth pulling at some more?

  • Assuming you haven't tried the workarounds for the LR corruption bug try this instead of just WFI, maybe insert immediately before the WFI:

    #if defined(SOFTDEVICE_PRESENT) &&  (NRF_SD_BLE_API_VERSION>=7)
        // Now using sd_app_evt_wait() which has the same workaround; it is softdevice dependant when BLE is used
        sd_app_evt_wait();
    #else
        // Errata 220: CPU: RAM is not ready when written - Disable IRQ while using WFE
        // Symptoms - Memory is not written in the first cycle after wake-up
        // Consequences - The address of the next instruction is not written to the stack. In stack frame, the link register is corrupted
        // Workaround
        // ==========
        // Enable SEVONPEND to disable interrupts so the internal events that generate the interrupt cause wakeuup in __WFE context and not in interrupt context
        // Before: ENABLE_WAKEUP_SOURCE -> __WFE -> WAKEUP_SOURCE_ISR -> CONTINUE_FROM_ISR  next line of __WFE
        // After:  ENABLE_WAKEUP_SOURCE -> SEVONPEND -> DISABLE_INTERRUPTS -> __WFE -> WAKEUP inside __WFE -> ENABLE_interrupts -> WAKEUP_SOURCE_ISR
        // Applications must not modify the SEVONPEND flag in the SCR register when running in priority levels higher than 6 (priority level numerical
        // values lower than 6) as this can lead to undefined behavior with SoftDevice enabled
        //
        // Errata 75: MWU: Increased current consumption
        // This has to be handled by turning off MWU but it is used in SoftDevice
        // see https://infocenter.nordicsemi.com/index.jsp?topic=%2Ferrata_nRF52832_EngB%2FERR%2FnRF52832%2FEngineeringB%2Flatest%2Fanomaly_832_75.html
        //
        // Errata 220: Enable SEVONPEND
        SCB->SCR |= SCB_SCR_SEVONPEND_Msk;
        __disable_irq();
        // Errata 75: MWU Disable
        uint32_t MWU_AccessWatchMask = NRF_MWU->REGIONEN & MWU_ACCESS_WATCH_MASK;
        // Handle MNU if any areas are enabled
        if (MWU_AccessWatchMask)
        {
            NRF_MWU->REGIONENCLR = MWU_AccessWatchMask; // Disable write access watch in region[0] and PREGION[0]
            __WFE();
            __WFE();
            __NOP(); __NOP(); __NOP(); __NOP();
            // Errata 75: MWU Enable
            NRF_MWU->REGIONENSET = MWU_AccessWatchMask;
        }
        else
        {
            __WFE();
            __WFE();
            __NOP(); __NOP(); __NOP(); __NOP();
        }
        __enable_irq();
    #endif
    

Reply
  • Assuming you haven't tried the workarounds for the LR corruption bug try this instead of just WFI, maybe insert immediately before the WFI:

    #if defined(SOFTDEVICE_PRESENT) &&  (NRF_SD_BLE_API_VERSION>=7)
        // Now using sd_app_evt_wait() which has the same workaround; it is softdevice dependant when BLE is used
        sd_app_evt_wait();
    #else
        // Errata 220: CPU: RAM is not ready when written - Disable IRQ while using WFE
        // Symptoms - Memory is not written in the first cycle after wake-up
        // Consequences - The address of the next instruction is not written to the stack. In stack frame, the link register is corrupted
        // Workaround
        // ==========
        // Enable SEVONPEND to disable interrupts so the internal events that generate the interrupt cause wakeuup in __WFE context and not in interrupt context
        // Before: ENABLE_WAKEUP_SOURCE -> __WFE -> WAKEUP_SOURCE_ISR -> CONTINUE_FROM_ISR  next line of __WFE
        // After:  ENABLE_WAKEUP_SOURCE -> SEVONPEND -> DISABLE_INTERRUPTS -> __WFE -> WAKEUP inside __WFE -> ENABLE_interrupts -> WAKEUP_SOURCE_ISR
        // Applications must not modify the SEVONPEND flag in the SCR register when running in priority levels higher than 6 (priority level numerical
        // values lower than 6) as this can lead to undefined behavior with SoftDevice enabled
        //
        // Errata 75: MWU: Increased current consumption
        // This has to be handled by turning off MWU but it is used in SoftDevice
        // see https://infocenter.nordicsemi.com/index.jsp?topic=%2Ferrata_nRF52832_EngB%2FERR%2FnRF52832%2FEngineeringB%2Flatest%2Fanomaly_832_75.html
        //
        // Errata 220: Enable SEVONPEND
        SCB->SCR |= SCB_SCR_SEVONPEND_Msk;
        __disable_irq();
        // Errata 75: MWU Disable
        uint32_t MWU_AccessWatchMask = NRF_MWU->REGIONEN & MWU_ACCESS_WATCH_MASK;
        // Handle MNU if any areas are enabled
        if (MWU_AccessWatchMask)
        {
            NRF_MWU->REGIONENCLR = MWU_AccessWatchMask; // Disable write access watch in region[0] and PREGION[0]
            __WFE();
            __WFE();
            __NOP(); __NOP(); __NOP(); __NOP();
            // Errata 75: MWU Enable
            NRF_MWU->REGIONENSET = MWU_AccessWatchMask;
        }
        else
        {
            __WFE();
            __WFE();
            __NOP(); __NOP(); __NOP(); __NOP();
        }
        __enable_irq();
    #endif
    

Children
  • Ok, I've reduced hmolesworth's suggested code to:

      SCB->SCR |= SCB_SCR_SEVONPEND_Msk;
      __disable_irq();
      __WFE();
      __WFE();
      __NOP(); __NOP(); __NOP(); __NOP();
      __enable_irq();

    so we're just looking at the errata 220 stuff.  This doesn't produce a noticable difference in behaviour - we're back to an IACCVIOL hardfault, but the behaviour is otherwise the same.  The LR and top of stack return point to probably-junk places outside normal values.

  • Hi,

    Thanks for confirming. So the 52833 is not impacted by errata 220. Though I expect the workaround for it could change the behaviour slightly like workaround for errata 87 did if the problem is indeed timing related.

    Could you please try to put the chip in constant latency mode on startup and see if it makes any difference? Constant latency mode (see Sub-power modes) should have similiar effect on interrupt latency as as putting the chip in debug mode has (both modes keep the HF clock running in sleep).

    e.g.

    void main(void)
    {
        NRF_POWER->TASKS_CONSTLAT = 1;

    With regards to __WFI vs__WFE, I think Anders gives a good summary of it in this post: https://devzone.nordicsemi.com/f/nordic-q-a/490/how-do-you-put-the-nrf51822-chip-to-sleep/2571#2571. The main difference is basically that __WFE enters sleep conditionally based on the internal event register, while __WFI does not. The reason for using one over the other is if implemention will be subjected to race conditons or not.

    Edit: Would you be able to share the full source code here or in a private support ticket so I can try to debug it here, or does it require your custom HW to run it?

    Edit 2 :  Have you tested this on a nordic DK? I'm starting to think that this may be a HW issue. I'd suggest suggest to go over the decoupling caps used on your board and see if they match the values we recommend in our reference design here:  Reference circuitry

  • Hi Vidar,

    Although our original design was incorrect with regard to the decoupling layout, this board has been updated to match configuration 6 WLCSP from the datasheet (and your link).  However, on careful examination, it looks like the cap fitted to DEC1 (C4 in the reference) is considerably too large - 2.2uF instead of 100nF.  I'll get our electronics chap to break out the fine soldering iron and change it, and let you know if that makes any difference.

    Also of note, we're running without any crystals - as we're not doing any BLE etc, the timing accuracy isn't critical, and fewer external components is hugely helpful in our application.  We're using the internal RCs for both HF and LF clocks.

    Putting the chip into constant latency mode makes the hardfault problem go away (or, at least, I haven't been able to reproduce it now).

    I can provide the code and schematic privately, although most of the code won't run without the custom hardware.  I will also need to excise some third-party code, as we're not authorised to share that, which will of course change the behaviour and timing somewhat.  Running on a DK has similar issues, although we can at least bolt on a lot of the necessary bits externally.  I'll have a look at porting the codebase over if changing the DEC1 cap doesn't improve things.

  • Hi Alison,

    I have actually seen reports of similiar behaviour before for boards that have been fitted with a too large capacitor on DEC1, so I think we may finally have found a root cause. I hope this is it.

    Putting the chip into constant latency mode makes the hardfault problem go away (or, at least, I haven't been able to reproduce it now).

    I think a possible explanation for this is that constant latency prevents the system from entering its lowest power state (min. idle current is ~500 uA with constant latency enabled).

  • Hi Vidar,

    I've done a number of different tests across several boards, and the DEC1 decoupling cap looks like our answer.  Dropping it to the 100nF it was supposed to be seems to have made all the hardfaults go away completely.

    I guess that the issue was more obvious at lower powers, possibly due to the cap not charging / discharging fast enough, and causing clock / rail skew, to the point where the core got really unhappy.  Hence always happening after a WFI, and being so sensitive to which peripherals were enabled or not. 

    Either way, excellent spot, and thank you for your help!  Beers on us next time you're in London...

Related