This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Wierd hardfault

I have a bit of an odd one, and would appreciate any input.  We have a fairly complex project, which does various things, but some of the time, I get a hardfault.  The issue appears to be a instruction access violation (MMFSR->IACCVIOL = 1).  Working backwards, the code that caused this seems to be the nrfx_coredep_delay_us function (in nrfx_coredep.h).

We're using the 17.0.2 SDK, and the code in question is this, and the PC points to delay_machine_code:

   __ALIGN(16)
    static const uint16_t delay_machine_code[] = {
        0x3800 + NRFX_COREDEP_DELAY_US_LOOP_CYCLES, // SUBS r0, #loop_cycles
        0xd8fd, // BHI .-2
        0x4770  // BX LR
    };

    typedef void (* delay_func_t)(uint32_t);
    const delay_func_t delay_cycles =
        // Set LSB to 1 to execute the code in the Thumb mode.
        (delay_func_t)((((uint32_t)delay_machine_code) | 1));
    uint32_t cycles = time_us * NRFX_DELAY_CPU_FREQ_MHZ;
    delay_cycles(cycles);

I wasn't entirely sure about the way the code is hand-loaded in, so I tried switching to using the DWT instead (since the 53833 supports that).  This produced the same hardfault behaviour, only now it's on

while ((DWT->CYCCNT - cyccnt_initial) < time_cycles)
    {}

I suspect there's something else going on here.  Our code does use nrf_delay_ms for certain things that need busy waits, which calls through to the above code, but it only hardfaults some of the time.  As above, we're using SDK 17.0.2, nrf52833 on a custom board.

I'm at a bit of a loss on this one, suggestions welcome?

Parents
  • Hello,

    Not sure if I have encountered this type of fault before, but I found a blog post that suggests that this error is raised when the CPU attempts to execute code from a memory protected region.

    IACCVIOL - Indicates that an attempt to execute an instruction triggered an MPU or Execute Never (XN) fault. We’ll explore an example below.

     https://interrupt.memfault.com/blog/cortex-m-fault-debug

    Could this be it, or are you not using the MPU in your project?

    Best regards,

    Vidar

  • Hi Vidar,

    Thanks for the article link, I'd actually already run across it (it's super useful for figuring out the debug). 

    We are not directly using either the Cortex MPU or the Nordic MWU.  Are either of those used by any Nordic libraries or drivers?

    Best,

      Alison

  • Hi Vidar,

    I've pulled in the Nordic hardfault library, which doesn't seem to point to a stack overrun directly - if I understand correctly, the call to HardFault_process() should pass NULL if it's a stack overrun, and that doesn't seem to be happening.  I get a valud pointer to a debug stack.

    Some quick calculations: this project has a stack size of 8192 bytes (8k), which given RAM starts at 0x20000000, should give a stack placement of 0x2001E000 to 0x20020000 (128kB RAM).  At the point of the hardfault, according to Ozone, the SP (R13) is pointing to 0x2001FF28, which should still be a legal bit of stack.  It's getting close to the edge, mind, which is concerning, but not actually over.

    The pointer the Nordic hardfault library gives me is 0x2001FF80, so a little higher than where Ozone says the SP is, which seems reasonable - I guess it's pointing a little higher up the call chain.

    Unfortunately, we're really tight for RAM in this project, hence the 8kB stack.  This means I can't easily bump up the stack to see if the problem goes away.  Does this look like stack overrun issues, based on the above, or should we hunt elsewhere?

  • Hi,

    Your understanding is correct, the pointer will be set to NULL if there is a stack overflow, but only if there is a stack overflow while the hardfault exception is raised. And a stack overrun doesn't necessarily trigger a fault immediately (if at all). To catch a stack overflow early you can use the Stack guard library or try to set a data breakpoint at the bottom of the stack.

    Setting data breakpoint in Ozone

    Did the hardfault handler print the debug messages? It would be interesting to see the CPU register values that were pushed on stack before the hardfault exception.

  • Hi Vidar,

    Unfortunately, having any sort of debugger attached while running prevents the hardfault showing up at all, which would tend to suggest some sort of timing issue.  Using a UART has the same effect.  I can attach to the running chip after the hardfault has happened, to see the state of things, but I can't watch the fault happen (which also means not getting the debug output).

    The stack frame passed to my HardFault_process function looks like this:

    And the CPU registers (according to Ozone) are:

    As the stack guard library uses the MPU, wouldn't it basically show the same symptom, i.e. an instruction access violation, if the stack was triggered? 

    I've had a go at the old-fashioned "write DEADBEEF to the stack at start-up" method, and that suggests that this code isn't actually over-running the stack.  At the point a hardfault has ocurred, connecting after the fact via Ozone shows that 0x2001E000 is all DEADBEEF still.  I've also confirmed again via MPU_CTRL that the MPU definitely doesn't think it's enabled.

    So, all in all, it doesn't look like stack overflow is the issue here.  Which brings us back to a wierd hardfault that doesn't seem to have any obvious cause, but occurs repeatably.

  • Hi,

    I agree, it doesn't sound like you have had stack overflow, but the PC value (0xFB04F85C) on stack is clearly invalid, so that must be the reason for the hardfault.

    It's intereseting that you're not able to replicate this in debug mode. Have you tried to comment the sleep/idle function in your application to see if that may have the same effect as being in debug mode?

  • Ok, the delay_ms call was a red herring, and obvious in retrospect - that's being called from the hardfault handler (to flash an LED), so it's not where the problem is ocurring.  Doh.

    Commenting it out made it obvious, as then the PC ends up pointing to the nrf_gpio_pin_* functions, which are setting the LEDs.  This also caused the resultant hardfault to change to an IBUSERR instead - the call stack included a call to __WFI(), which is slightly different.

    We do use __WFI() a fair bit to cause the CPU to sleep (and consume less power) when waiting for things to happen.  Most timing is provided via an RTC, plus a few external interrupts.  As above, we don't use a softdevice, so it shouldn't be a problem to call __WFI() directly, but a bit of digging around suggested that the following is better (we do use the FPU):

    static inline void sys_sleep (void)
    {
    #if (__FPU_USED == 1)
      __set_FPSCR(__get_FPSCR() & ~(0x0000009F)); 
      (void) __get_FPSCR();
      NVIC_ClearPendingIRQ(FPU_IRQn);
    #endif
      __WFI();
    }

    Replacing the WFI calls with that changed the hardfault to an UNDEFINSTR error instead.  Looking at the top of the call stack, I now seem to be somewhere in the heap, I think?

    So yes, that's probably not a valid instruction.  But how did we end up there in the first place?  That particular address happens to be the m_mem_pool long, from the Nordic mem_manager library (which we use to provide pseudo-dynamic memory allocation).  Looking through mem_manager.c, I don't see anything obvious that could turn into a "JMP &m_mem_pool" or similar - it's a status flags bitmap, so the usage is all setting and checking bits.

    The Hardfault library stack frame snapshot now looks like this:

    with the LR pointing to the WFI instruction in the above sys_sleep code.  So it seems to be something to do with the WFI call.

    Does anything in the above jump out at you as obviously suspicious or worth pulling at some more?

Reply
  • Ok, the delay_ms call was a red herring, and obvious in retrospect - that's being called from the hardfault handler (to flash an LED), so it's not where the problem is ocurring.  Doh.

    Commenting it out made it obvious, as then the PC ends up pointing to the nrf_gpio_pin_* functions, which are setting the LEDs.  This also caused the resultant hardfault to change to an IBUSERR instead - the call stack included a call to __WFI(), which is slightly different.

    We do use __WFI() a fair bit to cause the CPU to sleep (and consume less power) when waiting for things to happen.  Most timing is provided via an RTC, plus a few external interrupts.  As above, we don't use a softdevice, so it shouldn't be a problem to call __WFI() directly, but a bit of digging around suggested that the following is better (we do use the FPU):

    static inline void sys_sleep (void)
    {
    #if (__FPU_USED == 1)
      __set_FPSCR(__get_FPSCR() & ~(0x0000009F)); 
      (void) __get_FPSCR();
      NVIC_ClearPendingIRQ(FPU_IRQn);
    #endif
      __WFI();
    }

    Replacing the WFI calls with that changed the hardfault to an UNDEFINSTR error instead.  Looking at the top of the call stack, I now seem to be somewhere in the heap, I think?

    So yes, that's probably not a valid instruction.  But how did we end up there in the first place?  That particular address happens to be the m_mem_pool long, from the Nordic mem_manager library (which we use to provide pseudo-dynamic memory allocation).  Looking through mem_manager.c, I don't see anything obvious that could turn into a "JMP &m_mem_pool" or similar - it's a status flags bitmap, so the usage is all setting and checking bits.

    The Hardfault library stack frame snapshot now looks like this:

    with the LR pointing to the WFI instruction in the above sys_sleep code.  So it seems to be something to do with the WFI call.

    Does anything in the above jump out at you as obviously suspicious or worth pulling at some more?

Children
  • Assuming you haven't tried the workarounds for the LR corruption bug try this instead of just WFI, maybe insert immediately before the WFI:

    #if defined(SOFTDEVICE_PRESENT) &&  (NRF_SD_BLE_API_VERSION>=7)
        // Now using sd_app_evt_wait() which has the same workaround; it is softdevice dependant when BLE is used
        sd_app_evt_wait();
    #else
        // Errata 220: CPU: RAM is not ready when written - Disable IRQ while using WFE
        // Symptoms - Memory is not written in the first cycle after wake-up
        // Consequences - The address of the next instruction is not written to the stack. In stack frame, the link register is corrupted
        // Workaround
        // ==========
        // Enable SEVONPEND to disable interrupts so the internal events that generate the interrupt cause wakeuup in __WFE context and not in interrupt context
        // Before: ENABLE_WAKEUP_SOURCE -> __WFE -> WAKEUP_SOURCE_ISR -> CONTINUE_FROM_ISR  next line of __WFE
        // After:  ENABLE_WAKEUP_SOURCE -> SEVONPEND -> DISABLE_INTERRUPTS -> __WFE -> WAKEUP inside __WFE -> ENABLE_interrupts -> WAKEUP_SOURCE_ISR
        // Applications must not modify the SEVONPEND flag in the SCR register when running in priority levels higher than 6 (priority level numerical
        // values lower than 6) as this can lead to undefined behavior with SoftDevice enabled
        //
        // Errata 75: MWU: Increased current consumption
        // This has to be handled by turning off MWU but it is used in SoftDevice
        // see https://infocenter.nordicsemi.com/index.jsp?topic=%2Ferrata_nRF52832_EngB%2FERR%2FnRF52832%2FEngineeringB%2Flatest%2Fanomaly_832_75.html
        //
        // Errata 220: Enable SEVONPEND
        SCB->SCR |= SCB_SCR_SEVONPEND_Msk;
        __disable_irq();
        // Errata 75: MWU Disable
        uint32_t MWU_AccessWatchMask = NRF_MWU->REGIONEN & MWU_ACCESS_WATCH_MASK;
        // Handle MNU if any areas are enabled
        if (MWU_AccessWatchMask)
        {
            NRF_MWU->REGIONENCLR = MWU_AccessWatchMask; // Disable write access watch in region[0] and PREGION[0]
            __WFE();
            __WFE();
            __NOP(); __NOP(); __NOP(); __NOP();
            // Errata 75: MWU Enable
            NRF_MWU->REGIONENSET = MWU_AccessWatchMask;
        }
        else
        {
            __WFE();
            __WFE();
            __NOP(); __NOP(); __NOP(); __NOP();
        }
        __enable_irq();
    #endif
    

  • Hi,

    I was a bit confused too, I thought Ozone was displaying the stack trace in a reversed order, because it didn't match your observation otherwise. But I should have realized it was wrong after having seen the stack frame addresses. Anyway. It's good to have cleared that up.

    alison.lloyd said:
    Commenting it out made it obvious, as then the PC ends up pointing to the nrf_gpio_pin_* functions, which are setting the LEDs.  This also caused the resultant hardfault to change to an IBUSERR instead - the call stack included a call to __WFI(), which is slightly different.

    I think the the address/fault change is a result of the project being re-built with slighlty modified code. I don't see any other reason for why adding clearing of the FPU pending bit should make a difference. Edit: maybe you see the same effect if you change the compiler optimization settings?

    I see you have tagged the post with 'nRF52833', can you confirm that this is the chip you are using? I'm asking because if you are using 52832, you may be impacted by errata 220 as  suggested.

    Edit:

    The Hardfault library stack frame snapshot now looks like this:

    The last byte of the PSR register (Interrupt Program Status Register field) indicate the ISR number if the program was running in an interrupt context. Since it reads zero it does not appear to have been in an interrupt handler before the fault occurred.

  • Hi Vidar and hmolesworth,

    We're definitely using the '833 - NRF52833-CJAA-A0 if it makes any difference.  https://infocenter.nordicsemi.com/topic/errata_nRF52833_Rev1/ERR/nRF52833/Rev1/latest/err_833_new.html suggests that errata 220 doesn't apply, but I'll give the code you've suggested a go when I'm next in front of the board. 

    Errata 87 does apply, however, hence the code snippet above.

    The debug seems to be pointing to the __WFI() call causing the issues here, although that may be masking something else.  That would be consistent with the issue not showing up during debug, as WFI/WFE behaviour is presumably slightly different when a debugger is attached, to not cause dropouts.

    On a related note, my understanding of how the ARM core in the nRF52 is used is that the WFI and WFE instructions are almost the same, with the difference that one normally uses WFE with SEV to ensure the core actually goes to sleep:

    __SEV();
    __WFE(); // clear SEV bit
    __WFE(); // actually go to sleep

    Further reading here: https://developer.arm.com/documentation/dui0553/a/ .

    Is this correct?  Is there a good reason to use WFE instead of WFI in the Nordic chips?

  • Ok, I've reduced hmolesworth's suggested code to:

      SCB->SCR |= SCB_SCR_SEVONPEND_Msk;
      __disable_irq();
      __WFE();
      __WFE();
      __NOP(); __NOP(); __NOP(); __NOP();
      __enable_irq();

    so we're just looking at the errata 220 stuff.  This doesn't produce a noticable difference in behaviour - we're back to an IACCVIOL hardfault, but the behaviour is otherwise the same.  The LR and top of stack return point to probably-junk places outside normal values.

  • Hi,

    Thanks for confirming. So the 52833 is not impacted by errata 220. Though I expect the workaround for it could change the behaviour slightly like workaround for errata 87 did if the problem is indeed timing related.

    Could you please try to put the chip in constant latency mode on startup and see if it makes any difference? Constant latency mode (see Sub-power modes) should have similiar effect on interrupt latency as as putting the chip in debug mode has (both modes keep the HF clock running in sleep).

    e.g.

    void main(void)
    {
        NRF_POWER->TASKS_CONSTLAT = 1;

    With regards to __WFI vs__WFE, I think Anders gives a good summary of it in this post: https://devzone.nordicsemi.com/f/nordic-q-a/490/how-do-you-put-the-nrf51822-chip-to-sleep/2571#2571. The main difference is basically that __WFE enters sleep conditionally based on the internal event register, while __WFI does not. The reason for using one over the other is if implemention will be subjected to race conditons or not.

    Edit: Would you be able to share the full source code here or in a private support ticket so I can try to debug it here, or does it require your custom HW to run it?

    Edit 2 :  Have you tested this on a nordic DK? I'm starting to think that this may be a HW issue. I'd suggest suggest to go over the decoupling caps used on your board and see if they match the values we recommend in our reference design here:  Reference circuitry

Related