Wierd hardfault

RE: Wierd hardfault

Vidar Berg — Mon, 21 Jun 2021 18:26:07 GMT

Hi Alison,

Thanks for the update! I'm glad to hear that it fixed the problem. Will definelty remember keep this in mind next time I encounter a hardfault like this. And I agree, too long charging time of the cap seems to be the most likely explanation.

RE: Wierd hardfault

alison.lloyd — Mon, 21 Jun 2021 14:50:38 GMT

Hi Vidar,

I've done a number of different tests across several boards, and the DEC1 decoupling cap looks like our answer. Dropping it to the 100nF it was supposed to be seems to have made all the hardfaults go away completely.

I guess that the issue was more obvious at lower powers, possibly due to the cap not charging / discharging fast enough, and causing clock / rail skew, to the point where the core got really unhappy. Hence always happening after a WFI, and being so sensitive to which peripherals were enabled or not.

Either way, excellent spot, and thank you for your help! Beers on us next time you're in London...

RE: Wierd hardfault

Vidar Berg — Tue, 15 Jun 2021 14:24:52 GMT

Hi Alison,

I have actually seen reports of similiar behaviour before for boards that have been fitted with a too large capacitor on DEC1, so I think we may finally have found a root cause. I hope this is it.

[quote userid="52927" url="~/f/nordic-q-a/75722/wierd-hardfault/315444#315444"]Putting the chip into constant latency mode makes the hardfault problem go away (or, at least, I haven't been able to reproduce it now).[/quote]

I think a possible explanation for this is that constant latency prevents the system from entering its lowest power state (min. idle current is ~500 uA with constant latency enabled).

RE: Wierd hardfault

alison.lloyd — Tue, 15 Jun 2021 14:15:17 GMT

Hi Vidar,

Although our original design was incorrect with regard to the decoupling layout, this board has been updated to match configuration 6 WLCSP from the datasheet (and your link). However, on careful examination, it looks like the cap fitted to DEC1 (C4 in the reference) is considerably too large - 2.2uF instead of 100nF. I'll get our electronics chap to break out the fine soldering iron and change it, and let you know if that makes any difference.

Also of note, we're running without any crystals - as we're not doing any BLE etc, the timing accuracy isn't critical, and fewer external components is hugely helpful in our application. We're using the internal RCs for both HF and LF clocks.

Putting the chip into constant latency mode makes the hardfault problem go away (or, at least, I haven't been able to reproduce it now).

I can provide the code and schematic privately, although most of the code won't run without the custom hardware. I will also need to excise some third-party code, as we're not authorised to share that, which will of course change the behaviour and timing somewhat. Running on a DK has similar issues, although we can at least bolt on a lot of the necessary bits externally. I'll have a look at porting the codebase over if changing the DEC1 cap doesn't improve things.

RE: Wierd hardfault

Vidar Berg — Tue, 15 Jun 2021 11:07:28 GMT

Hi,

Thanks for confirming. So the 52833 is not impacted by errata 220. Though I expect the workaround for it could change the behaviour slightly like workaround for errata 87 did if the problem is indeed timing related.

Could you please try to put the chip in constant latency mode on startup and see if it makes any difference? Constant latency mode (see Sub-power modes) should have similiar effect on interrupt latency as as putting the chip in debug mode has (both modes keep the HF clock running in sleep).

e.g.

void main(void)
{
    NRF_POWER->TASKS_CONSTLAT = 1;

With regards to __WFI vs__WFE, I think Anders gives a good summary of it in this post: https://devzone.nordicsemi.com/f/nordic-q-a/490/how-do-you-put-the-nrf51822-chip-to-sleep/2571#2571. The main difference is basically that __WFE enters sleep conditionally based on the internal event register, while __WFI does not. The reason for using one over the other is if implemention will be subjected to race conditons or not.

Edit: Would you be able to share the full source code here or in a private support ticket so I can try to debug it here, or does it require your custom HW to run it?

Edit 2 : Have you tested this on a nordic DK? I'm starting to think that this may be a HW issue. I'd suggest suggest to go over the decoupling caps used on your board and see if they match the values we recommend in our reference design here: Reference circuitry

RE: Wierd hardfault

alison.lloyd — Tue, 15 Jun 2021 10:57:10 GMT

Ok, I've reduced hmolesworth's suggested code to:

  SCB->SCR |= SCB_SCR_SEVONPEND_Msk;
  __disable_irq();
  __WFE();
  __WFE();
  __NOP(); __NOP(); __NOP(); __NOP();
  __enable_irq();

so we're just looking at the errata 220 stuff. This doesn't produce a noticable difference in behaviour - we're back to an IACCVIOL hardfault, but the behaviour is otherwise the same. The LR and top of stack return point to probably-junk places outside normal values.

RE: Wierd hardfault

alison.lloyd — Mon, 14 Jun 2021 15:17:48 GMT

Hi Vidar and hmolesworth,

We're definitely using the '833 - NRF52833-CJAA-A0 if it makes any difference. https://infocenter.nordicsemi.com/topic/errata_nRF52833_Rev1/ERR/nRF52833/Rev1/latest/err_833_new.html suggests that errata 220 doesn't apply, but I'll give the code you've suggested a go when I'm next in front of the board.

Errata 87 does apply, however, hence the code snippet above.

The debug seems to be pointing to the __WFI() call causing the issues here, although that may be masking something else. That would be consistent with the issue not showing up during debug, as WFI/WFE behaviour is presumably slightly different when a debugger is attached, to not cause dropouts.

On a related note, my understanding of how the ARM core in the nRF52 is used is that the WFI and WFE instructions are almost the same, with the difference that one normally uses WFE with SEV to ensure the core actually goes to sleep:

__SEV();
__WFE(); // clear SEV bit
__WFE(); // actually go to sleep

Further reading here: https://developer.arm.com/documentation/dui0553/a/ .

Is this correct? Is there a good reason to use WFE instead of WFI in the Nordic chips?

RE: Wierd hardfault

Vidar Berg — Mon, 14 Jun 2021 11:09:02 GMT

Hi,

I was a bit confused too, I thought Ozone was displaying the stack trace in a reversed order, because it didn't match your observation otherwise. But I should have realized it was wrong after having seen the stack frame addresses. Anyway. It's good to have cleared that up.

[quote user="alison.lloyd"]Commenting it out made it obvious, as then the PC ends up pointing to the nrf_gpio_pin_* functions, which are setting the LEDs. This also caused the resultant hardfault to change to an IBUSERR instead - the call stack included a call to __WFI(), which is slightly different.[/quote]

I think the the address/fault change is a result of the project being re-built with slighlty modified code. I don't see any other reason for why adding clearing of the FPU pending bit should make a difference. Edit: maybe you see the same effect if you change the compiler optimization settings?

I see you have tagged the post with 'nRF52833', can you confirm that this is the chip you are using? I'm asking because if you are using 52832, you may be impacted by errata 220 as hmolesworth suggested.

Edit:

[quote userid="52927" url="~/f/nordic-q-a/75722/wierd-hardfault/314981#314981"]The Hardfault library stack frame snapshot now looks like this:[/quote]

The last byte of the PSR register (Interrupt Program Status Register field) indicate the ISR number if the program was running in an interrupt context. Since it reads zero it does not appear to have been in an interrupt handler before the fault occurred.

RE: Wierd hardfault

hmolesworth — Fri, 11 Jun 2021 19:47:13 GMT

Assuming you haven't tried the workarounds for the LR corruption bug try this instead of just WFI, maybe insert immediately before the WFI:

#if defined(SOFTDEVICE_PRESENT) &&  (NRF_SD_BLE_API_VERSION>=7)
    // Now using sd_app_evt_wait() which has the same workaround; it is softdevice dependant when BLE is used
    sd_app_evt_wait();
#else
    // Errata 220: CPU: RAM is not ready when written - Disable IRQ while using WFE
    // Symptoms - Memory is not written in the first cycle after wake-up
    // Consequences - The address of the next instruction is not written to the stack. In stack frame, the link register is corrupted
    // Workaround
    // ==========
    // Enable SEVONPEND to disable interrupts so the internal events that generate the interrupt cause wakeuup in __WFE context and not in interrupt context
    // Before: ENABLE_WAKEUP_SOURCE -> __WFE -> WAKEUP_SOURCE_ISR -> CONTINUE_FROM_ISR  next line of __WFE
    // After:  ENABLE_WAKEUP_SOURCE -> SEVONPEND -> DISABLE_INTERRUPTS -> __WFE -> WAKEUP inside __WFE -> ENABLE_interrupts -> WAKEUP_SOURCE_ISR
    // Applications must not modify the SEVONPEND flag in the SCR register when running in priority levels higher than 6 (priority level numerical
    // values lower than 6) as this can lead to undefined behavior with SoftDevice enabled
    //
    // Errata 75: MWU: Increased current consumption
    // This has to be handled by turning off MWU but it is used in SoftDevice
    // see https://infocenter.nordicsemi.com/index.jsp?topic=%2Ferrata_nRF52832_EngB%2FERR%2FnRF52832%2FEngineeringB%2Flatest%2Fanomaly_832_75.html
    //
    // Errata 220: Enable SEVONPEND
    SCB->SCR |= SCB_SCR_SEVONPEND_Msk;
    __disable_irq();
    // Errata 75: MWU Disable
    uint32_t MWU_AccessWatchMask = NRF_MWU->REGIONEN & MWU_ACCESS_WATCH_MASK;
    // Handle MNU if any areas are enabled
    if (MWU_AccessWatchMask)
    {
        NRF_MWU->REGIONENCLR = MWU_AccessWatchMask; // Disable write access watch in region[0] and PREGION[0]
        __WFE();
        __WFE();
        __NOP(); __NOP(); __NOP(); __NOP();
        // Errata 75: MWU Enable
        NRF_MWU->REGIONENSET = MWU_AccessWatchMask;
    }
    else
    {
        __WFE();
        __WFE();
        __NOP(); __NOP(); __NOP(); __NOP();
    }
    __enable_irq();
#endif

RE: Wierd hardfault

alison.lloyd — Fri, 11 Jun 2021 18:00:41 GMT

Ok, the delay_ms call was a red herring, and obvious in retrospect - that's being called from the hardfault handler (to flash an LED), so it's not where the problem is ocurring. Doh.

Commenting it out made it obvious, as then the PC ends up pointing to the nrf_gpio_pin_* functions, which are setting the LEDs. This also caused the resultant hardfault to change to an IBUSERR instead - the call stack included a call to __WFI(), which is slightly different.

We do use __WFI() a fair bit to cause the CPU to sleep (and consume less power) when waiting for things to happen. Most timing is provided via an RTC, plus a few external interrupts. As above, we don't use a softdevice, so it shouldn't be a problem to call __WFI() directly, but a bit of digging around suggested that the following is better (we do use the FPU):

static inline void sys_sleep (void)
{
#if (__FPU_USED == 1)
  __set_FPSCR(__get_FPSCR() & ~(0x0000009F)); 
  (void) __get_FPSCR();
  NVIC_ClearPendingIRQ(FPU_IRQn);
#endif
  __WFI();
}

Replacing the WFI calls with that changed the hardfault to an UNDEFINSTR error instead. Looking at the top of the call stack, I now seem to be somewhere in the heap, I think?

So yes, that's probably not a valid instruction. But how did we end up there in the first place? That particular address happens to be the m_mem_pool long, from the Nordic mem_manager library (which we use to provide pseudo-dynamic memory allocation). Looking through mem_manager.c, I don't see anything obvious that could turn into a "JMP &m_mem_pool" or similar - it's a status flags bitmap, so the usage is all setting and checking bits.

The Hardfault library stack frame snapshot now looks like this:

with the LR pointing to the WFI instruction in the above sys_sleep code. So it seems to be something to do with the WFI call.

Does anything in the above jump out at you as obviously suspicious or worth pulling at some more?

RE: Wierd hardfault

Vidar Berg — Fri, 11 Jun 2021 11:12:22 GMT

Hi,

I agree, it doesn't sound like you have had stack overflow, but the PC value (0xFB04F85C) on stack is clearly invalid, so that must be the reason for the hardfault.

It's intereseting that you're not able to replicate this in debug mode. Have you tried to comment the sleep/idle function in your application to see if that may have the same effect as being in debug mode?

RE: Wierd hardfault

alison.lloyd — Thu, 10 Jun 2021 10:19:09 GMT

Hi Vidar,

Unfortunately, having any sort of debugger attached while running prevents the hardfault showing up at all, which would tend to suggest some sort of timing issue. Using a UART has the same effect. I can attach to the running chip after the hardfault has happened, to see the state of things, but I can't watch the fault happen (which also means not getting the debug output).

The stack frame passed to my HardFault_process function looks like this:

And the CPU registers (according to Ozone) are:

As the stack guard library uses the MPU, wouldn't it basically show the same symptom, i.e. an instruction access violation, if the stack was triggered?

I've had a go at the old-fashioned "write DEADBEEF to the stack at start-up" method, and that suggests that this code isn't actually over-running the stack. At the point a hardfault has ocurred, connecting after the fact via Ozone shows that 0x2001E000 is all DEADBEEF still. I've also confirmed again via MPU_CTRL that the MPU definitely doesn't think it's enabled.

So, all in all, it doesn't look like stack overflow is the issue here. Which brings us back to a wierd hardfault that doesn't seem to have any obvious cause, but occurs repeatably.

RE: Wierd hardfault

Vidar Berg — Thu, 10 Jun 2021 09:21:16 GMT

Hi,

Your understanding is correct, the pointer will be set to NULL if there is a stack overflow, but only if there is a stack overflow while the hardfault exception is raised. And a stack overrun doesn't necessarily trigger a fault immediately (if at all). To catch a stack overflow early you can use the Stack guard library or try to set a data breakpoint at the bottom of the stack.

Setting data breakpoint in Ozone

Did the hardfault handler print the debug messages? It would be interesting to see the CPU register values that were pushed on stack before the hardfault exception.

RE: Wierd hardfault

alison.lloyd — Wed, 09 Jun 2021 16:41:45 GMT

Hi Vidar,

I've pulled in the Nordic hardfault library, which doesn't seem to point to a stack overrun directly - if I understand correctly, the call to HardFault_process() should pass NULL if it's a stack overrun, and that doesn't seem to be happening. I get a valud pointer to a debug stack.

Some quick calculations: this project has a stack size of 8192 bytes (8k), which given RAM starts at 0x20000000, should give a stack placement of 0x2001E000 to 0x20020000 (128kB RAM). At the point of the hardfault, according to Ozone, the SP (R13) is pointing to 0x2001FF28, which should still be a legal bit of stack. It's getting close to the edge, mind, which is concerning, but not actually over.

The pointer the Nordic hardfault library gives me is 0x2001FF80, so a little higher than where Ozone says the SP is, which seems reasonable - I guess it's pointing a little higher up the call chain.

Unfortunately, we're really tight for RAM in this project, hence the 8kB stack. This means I can't easily bump up the stack to see if the problem goes away. Does this look like stack overrun issues, based on the above, or should we hunt elsewhere?

RE: Wierd hardfault

Vidar Berg — Mon, 07 Jun 2021 09:59:42 GMT

Hi Alison,

I don't think there is anything that points directly to the FPU at this point, but it may be related (e.g. contribute to a stack overflow). Could you maybe try with our HardFault handling library and see if it too points to the same address (i.e. PC value on stack)? The ARM documentation indicates that this fault doesn't neccessarly require the MPU to be enabled.

RE: Wierd hardfault

alison.lloyd — Fri, 04 Jun 2021 12:51:29 GMT

Hi Vidar,

I checked this today, and the MPU is disabled (checking the MPU_CTRL register via Ozone). We're also not using a softdevice in this project at all, so not that. I'm not directly using the MPU or stack guard libraries.

Could this be related to FPU operation somehow?

Best,

Alison

RE: Wierd hardfault

Vidar Berg — Tue, 01 Jun 2021 09:28:42 GMT

Hi Alison,

We are only using the MPU in our MPU driver and Stack guard library as far as I'm aware, but I would recommend you check the MPU enable bit in the MPU->MPU_CTRL register (@ address 0xe000ed94) to be absolutely sure it's disabled in your project.

nrfjprog --memrd 0xe000ed94 // Should return 0x0 if MPU is disabled. Should probably be read after the fault exception has occurred too.

The MWU should not cause the hardfault handler to get invoked. The Softdevice reserves this peripheral for its Memory isolation and runtime protection mechanism which will invoke the SDKs fault handler callback whenever an access violation is detected.

Best regards,

Vidar

RE: Wierd hardfault

alison.lloyd — Mon, 31 May 2021 18:13:02 GMT

Hi Vidar,

Thanks for the article link, I'd actually already run across it (it's super useful for figuring out the debug).

We are not directly using either the Cortex MPU or the Nordic MWU. Are either of those used by any Nordic libraries or drivers?

Best,

Alison

RE: Wierd hardfault

Vidar Berg — Fri, 28 May 2021 16:05:58 GMT

Hello,

Not sure if I have encountered this type of fault before, but I found a blog post that suggests that this error is raised when the CPU attempts to execute code from a memory protected region.

IACCVIOL - Indicates that an attempt to execute an instruction triggered an MPU or Execute Never (XN) fault. We’ll explore an example below.

https://interrupt.memfault.com/blog/cortex-m-fault-debug

Could this be it, or are you not using the MPU in your project?

Best regards,

Vidar