Trying to figure out a bus fault error (Zephyr + NRF52840)

Hey all,

I'm looking for a bit of guidance here to nail a bus fault that happens about 30 minutes into the program. 

The message is the following:

[00:32:57.663,177] <err> os: ***** BUS FAULT *****
[00:32:57.663,177] <err> os:   Precise data bus error
[00:32:57.663,208] <err> os:   BFAR Address: 0x80bf68c8
[00:32:57.663,208] <err> os: r0/a1:  0x80bf68b4  r1/a2:  0x20029668  r2/a3:  0x20029640
[00:32:57.663,238] <err> os: r3/a4:  0x00000040 r12/ip:  0x2002158b r14/lr:  0x0004307d
[00:32:57.663,238] <err> os:  xpsr:  0x91000000
[00:32:57.663,238] <err> os: Faulting instruction address (r15/pc): 0x00043d9e
[00:32:57.663,299] <err> os: >>> ZEPHYR FATAL ERROR 25: Unknown error on CPU 0
[00:32:57.663,330] <err> os: Current thread: 0x20002828 (unknown)
[00:32:57.724,395] <err> fatal_error: Resetting system

Reading through previous threads I found some information:

(base) PS \build\zephyr> arm-none-eabi-addr2line.exe -e .\zephyr.elf 0x4307d
C:/ncs/v2.5.0/zephyr/kernel/sched.c:1959
(base) PS \build\zephyr> arm-none-eabi-addr2line.exe -e .\zephyr.elf 0x43d9e
C:/ncs/v2.5.0/zephyr/kernel/pipes.c:217

That tells me that the error seems to be in one of the pipes. Additionally I found the following:

0x20029668:
0x20029640:
 .noinit."../src/tasks/controller.c".0
                0x0000000020028f60      0x800 app/libapp.a(controller.c.obj)
                0x0000000020028f60                _k_thread_stack_controller_rx_task_tid

It seems the receiver task (controller_rx_task_tid) is causing the issue, which makes sense because that thread is related to writing to some pipes. So far I think I'm able to narrow it to where the issue is happening. 

What I'm confused about and could not find any information about was:

[00:32:57.663,208] <err> os:   BFAR Address: 0x80bf68c8

Is there a peripheral with that address on the NRF52840 series? I could not find that on the product specification. There are two addresses in that range that I'm not sure where they are:

[00:32:57.663,208] <err> os:   BFAR Address: 0x80bf68c8
[00:32:57.663,208] <err> os: r0/a1:  0x80bf68b4

Any suggestions or additional information you can give me that could help me search for other things is appreciated.

Cheers,

Alberto

  • You are looking at a stack overflow that has overwritten important thread information that the scheduler uses when switching between threads.

    Can happen when you increase (or tun on) logging levels. Try a bigger stack for that thread.

  • Hello Alberto,

    It looks like the bus fault is caused by 'thread' pointer passed as the first argument to pipe_walk_op() being corrupted. '0x80bf68b4' and '0x80bf68c8' does not map to any valid memory on the device, and the fault is triggered as soon as the function tries to assign the 'pipe descriptor' from the corrupt thread struct pointer.

    t

    I suspect the thread struct is being overwritten elsewhere in your appliction, maybe by a buffer overrun. A stack overflow should normally have been caught by the stack guard. Though, increasing the stack sizes are worth a try.

    Are you able to check the 'struct k_thread' variables in your application after the crash to see if they have been overwritten? 

    Best regards,

    Vidar 

  • Hey Vidar, Turbo J,

    I will try by increasing the stack size, that is my first suspicion, everything else seems to work well.

     I'm not sure how to check the 'struct k_thread' variables after the crash... I don't have the debugger connected or running. The firmware is compiled in release mode, I have access to J-Link, but it would take me some time to figure out how to check the variables using Ozone. And of course, after the crash, the unit resets... but I'll look into it.

    Any other way I can check the k_thread variables after the crash?

    Cheers,

    Alberto

  • You would have to catch the crash with a connected debugger - a  hard fault reset will likely just overwrite that structure in RAM.

    Won't help you much in buffer overflow cases - the cow has already turned into hamburger at this point in time.

    Its usually simpler to check the map file what sturctures and varables are "close to" the k_thread structure thats getting trashed and check code that touches these for BO.

  • Thank you  .

    Alberto, to avoid the hardfault reset during, you may temporalily disable the CONFIG_RESET_ON_FATAL_ERROR setting in your release build. This will cause the program enter an infinte loop in the fault handler. You can then use Segger Ozone or nrfjprog to inspect the memory. Remember to select "attach to running target" if using Ozone so the chip does not reset.

    Its usually simpler to check the map file what sturctures and varables are "close to" the k_thread structure thats getting trashed and check code that touches these for BO.

    This is a good suggestion. You can check the name of the 'struct k_thread' variable(s) you have declared in your application and then look up their addresses in the build/zephyr/zephyr.map file.

    Reading memory with nrfjprog

    $ nrfjprog --memrd <insert RAM address here>

Related