This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Wierd hardfault

I have a bit of an odd one, and would appreciate any input. We have a fairly complex project, which does various things, but some of the time, I get a hardfault. The issue appears to be a instruction access violation (MMFSR->IACCVIOL = 1). Working backwards, the code that caused this seems to be the nrfx_coredep_delay_us function (in nrfx_coredep.h).

We're using the 17.0.2 SDK, and the code in question is this, and the PC points to delay_machine_code:

   __ALIGN(16)
    static const uint16_t delay_machine_code[] = {
        0x3800 + NRFX_COREDEP_DELAY_US_LOOP_CYCLES, // SUBS r0, #loop_cycles
        0xd8fd, // BHI .-2
        0x4770  // BX LR
    };

    typedef void (* delay_func_t)(uint32_t);
    const delay_func_t delay_cycles =
        // Set LSB to 1 to execute the code in the Thumb mode.
        (delay_func_t)((((uint32_t)delay_machine_code) | 1));
    uint32_t cycles = time_us * NRFX_DELAY_CPU_FREQ_MHZ;
    delay_cycles(cycles);

I wasn't entirely sure about the way the code is hand-loaded in, so I tried switching to using the DWT instead (since the 53833 supports that). This produced the same hardfault behaviour, only now it's on

while ((DWT->CYCCNT - cyccnt_initial) < time_cycles)
    {}

I suspect there's something else going on here. Our code does use nrf_delay_ms for certain things that need busy waits, which calls through to the above code, but it only hardfaults some of the time. As above, we're using SDK 17.0.2, nrf52833 on a custom board.

I'm at a bit of a loss on this one, suggestions welcome?

Top Replies

Parents

0 Vidar Berg over 4 years ago

Hello,

Not sure if I have encountered this type of fault before, but I found a blog post that suggests that this error is raised when the CPU attempts to execute code from a memory protected region.

IACCVIOL - Indicates that an attempt to execute an instruction triggered an MPU or Execute Never (XN) fault. We’ll explore an example below.

https://interrupt.memfault.com/blog/cortex-m-fault-debug

Could this be it, or are you not using the MPU in your project?

Best regards,

Vidar
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 alison.lloyd over 4 years ago in reply to Vidar Berg

Hi Vidar,

Thanks for the article link, I'd actually already run across it (it's super useful for figuring out the debug).

We are not directly using either the Cortex MPU or the Nordic MWU. Are either of those used by any Nordic libraries or drivers?

Best,

Alison
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Vidar Berg over 4 years ago in reply to alison.lloyd

Hi Alison,

We are only using the MPU in our MPU driver and Stack guard library as far as I'm aware, but I would recommend you check the MPU enable bit in the MPU->MPU_CTRL register (@ address 0xe000ed94) to be absolutely sure it's disabled in your project.

nrfjprog --memrd 0xe000ed94 // Should return 0x0 if MPU is disabled. Should probably be read after the fault exception has occurred too.

The MWU should not cause the hardfault handler to get invoked. The Softdevice reserves this peripheral for its Memory isolation and runtime protection mechanism which will invoke the SDKs fault handler callback whenever an access violation is detected.

Best regards,

Vidar
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 alison.lloyd over 4 years ago in reply to Vidar Berg

Hi Vidar,

I checked this today, and the MPU is disabled (checking the MPU_CTRL register via Ozone). We're also not using a softdevice in this project at all, so not that. I'm not directly using the MPU or stack guard libraries.

Could this be related to FPU operation somehow?

Best,

Alison
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 alison.lloyd over 4 years ago in reply to Vidar Berg

Hi Vidar,

I checked this today, and the MPU is disabled (checking the MPU_CTRL register via Ozone). We're also not using a softdevice in this project at all, so not that. I'm not directly using the MPU or stack guard libraries.

Could this be related to FPU operation somehow?

Best,

Alison
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

0 Vidar Berg over 4 years ago in reply to alison.lloyd

Hi Alison,

I don't think there is anything that points directly to the FPU at this point, but it may be related (e.g. contribute to a stack overflow). Could you maybe try with our HardFault handling library and see if it too points to the same address (i.e. PC value on stack)? The ARM documentation indicates that this fault doesn't neccessarly require the MPU to be enabled.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 alison.lloyd over 4 years ago in reply to Vidar Berg

Hi Vidar,

I've pulled in the Nordic hardfault library, which doesn't seem to point to a stack overrun directly - if I understand correctly, the call to HardFault_process() should pass NULL if it's a stack overrun, and that doesn't seem to be happening. I get a valud pointer to a debug stack.

Some quick calculations: this project has a stack size of 8192 bytes (8k), which given RAM starts at 0x20000000, should give a stack placement of 0x2001E000 to 0x20020000 (128kB RAM). At the point of the hardfault, according to Ozone, the SP (R13) is pointing to 0x2001FF28, which should still be a legal bit of stack. It's getting close to the edge, mind, which is concerning, but not actually over.

The pointer the Nordic hardfault library gives me is 0x2001FF80, so a little higher than where Ozone says the SP is, which seems reasonable - I guess it's pointing a little higher up the call chain.

Unfortunately, we're really tight for RAM in this project, hence the 8kB stack. This means I can't easily bump up the stack to see if the problem goes away. Does this look like stack overrun issues, based on the above, or should we hunt elsewhere?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Vidar Berg over 4 years ago in reply to alison.lloyd

Hi,

Your understanding is correct, the pointer will be set to NULL if there is a stack overflow, but only if there is a stack overflow while the hardfault exception is raised. And a stack overrun doesn't necessarily trigger a fault immediately (if at all). To catch a stack overflow early you can use the Stack guard library or try to set a data breakpoint at the bottom of the stack.

Setting data breakpoint in Ozone

Did the hardfault handler print the debug messages? It would be interesting to see the CPU register values that were pushed on stack before the hardfault exception.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 alison.lloyd over 4 years ago in reply to Vidar Berg

Hi Vidar,

Unfortunately, having any sort of debugger attached while running prevents the hardfault showing up at all, which would tend to suggest some sort of timing issue. Using a UART has the same effect. I can attach to the running chip after the hardfault has happened, to see the state of things, but I can't watch the fault happen (which also means not getting the debug output).

The stack frame passed to my HardFault_process function looks like this:

And the CPU registers (according to Ozone) are:

As the stack guard library uses the MPU, wouldn't it basically show the same symptom, i.e. an instruction access violation, if the stack was triggered?

I've had a go at the old-fashioned "write DEADBEEF to the stack at start-up" method, and that suggests that this code isn't actually over-running the stack. At the point a hardfault has ocurred, connecting after the fact via Ozone shows that 0x2001E000 is all DEADBEEF still. I've also confirmed again via MPU_CTRL that the MPU definitely doesn't think it's enabled.

So, all in all, it doesn't look like stack overflow is the issue here. Which brings us back to a wierd hardfault that doesn't seem to have any obvious cause, but occurs repeatably.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Vidar Berg over 4 years ago in reply to alison.lloyd

Hi,

I agree, it doesn't sound like you have had stack overflow, but the PC value (0xFB04F85C) on stack is clearly invalid, so that must be the reason for the hardfault.

It's intereseting that you're not able to replicate this in debug mode. Have you tried to comment the sleep/idle function in your application to see if that may have the same effect as being in debug mode?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel