nRF54L15 crashes on debug when using TFM due to GDB triggering the MPC while reading call stacks

I'm building the nrf/samples/openthread/cli sample with the nrf54l15dk/nrf54l15/cpuapp/ns board. The only change I'm making is enabling CONFIG_DEBUG_THREAD_INFO.

The app runs fine until you pause execution in any way with the debugger. When GDB pauses execution, it reads the call stacks which eventually lead to a dummy return address since GDB doesn't know to stop at z_thread_entry. GDB tries to continue unwinding the call stack and ends up reading the memory at the dummy return address. This triggers the MPC to cause a memory access error. For instance, if CONFIG_INIT_STACKS is enabled, this results in a memory access error at 0xAAAAAAAA since the stack was initialized to that value.

I've verified that this is the cause by purposefully placing a custom canary value in the initial return address. When I do this, I can see that the memory access error always lines up with what I put there. I end up in a TF-M halt.

Additionally, although this is potentially hacky? You can do a similar thing, replacing the return address with the return address for main.

This takes advantage of GDB's built in functionality where it stops unwinding the stack after seeing the main symbol. This "fix" could have other side effects though that make it impractical.

This causes GDB to stop unwinding and prevents the memory access error.

The changes shown here are in zephyr/kernel/thread.c within the CONFIG_INIT_STACKS ifdef.

Parents
  • Hi,

     

    Thank you for sharing.

    I took an arbitrary example, and used "west attach" to connect to an already running debug session, with this backtrace:

    (gdb) bt
    #0  __enable_irq () at /opt/ncs/modules/hal/cmsis/CMSIS/Core/Include/cmsis_gcc.h:951
    #1  arch_cpu_idle () at /opt/ncs/zephyr/arch/arm/core/cortex_m/cpu_idle.c:104
    #2  0x0000613c in k_cpu_idle () at /opt/ncs/zephyr/include/zephyr/kernel.h:6323
    #3  idle (unused1=<optimized out>, unused2=<optimized out>, unused3=<optimized out>) at /opt/ncs/zephyr/kernel/idle.c:75
    #4  0x00000d02 in z_thread_entry (entry=0x611d <idle>, p1=0x20000d18 <_kernel>, p2=0x0 <cbvprintf_package>, 
        p3=0x0 <cbvprintf_package>) at /opt/ncs/zephyr/lib/os/thread_entry.c:48
    #5  0x00000000 in ?? ()

     

    Now, I printed the MPC.EVENTS_MEMACCERR, MEMACCERR.*, with the addresses in the datasheet (https://docs.nordicsemi.com/bundle/ps_nrf54L15/page/mpc.html#ariaid-title4):

    (gdb) print *0x50041100
    $1 = 0x0
    (gdb) print *0x50041400
    $2 = 0x0
    (gdb) print *0x50041404
    $3 = 0x0

     

    So far, so good.

    Now, to trigger the MPC:

    (gdb) print *0xaaaaaaaa
    Cannot access memory at address 0xaaaaaaaa

     

    And print the MPC registers mentioned above:

    (gdb) print *0x50041404
    $4 = 0x19041
    (gdb) print *0x50041400
    $5 = 0xaaaaaaaa
    (gdb) print *0x50041100
    $6 = 0x1

     This behaviour correlates with your observation:

    The app runs fine until you pause execution in any way with the debugger. When GDB pauses execution, it reads the call stacks which eventually lead to a dummy return address since GDB doesn't know to stop at z_thread_entry

    With your configuration, ie. with secure/non-secure split:

    nrf54l15dk/nrf54l15/cpuapp/ns board

    secure side, ie. TFM, will own the MPC peripheral.

     

    I took the hello_world sample, configured it with /ns board, and tried the same procedure. I can see that the MPC registers are set as before when doing a print *0xAAAAAAAA, and can confirm that TF-M will trigger a fault on this:

    arch_cpu_idle () at /opt/ncs/zephyr/arch/arm/core/cortex_m/cpu_idle.c:104
    104             __enable_irq();
    (gdb) print *0xaaaaaaaa
    Cannot access memory at address 0xaaaaaaaa
    (gdb) c
    Continuing.
    ^C
    Program received signal SIGTRAP, Trace/breakpoint trap.
    tfm_hal_system_halt () at /opt/ncs/nrf/modules/trusted-firmware-m/tfm_boards/common/tfm_hal_reset_halt.c:30
    30              while (1) {
    (gdb) bt
    #0  tfm_hal_system_halt () at /opt/ncs/nrf/modules/trusted-firmware-m/tfm_boards/common/tfm_hal_reset_halt.c:30
    #1  0x0000744c in tfm_core_panic () at /opt/ncs/modules/tee/tf-m/trusted-firmware-m/secure_fw/spm/core/utilities.c:26
    #2  0x0000640a in MPC_Handler ()
        at /opt/ncs/modules/tee/tf-m/trusted-firmware-m/platform/ext/target/nordic_nrf/common/core/faults.c:102
    Backtrace stopped: previous fra
     

     

    Kind regards,

    Håkon

  • I tested the hello_world sample to simplify this as well. To make the crash more repeatable, it's helpful to have CONFIG_INIT_STACKS=y in order to force that return address to a bogus address that is unmapped in memory. If the address is 0x0, the MPC won't trigger. Any thoughts on whether this is worth bringing up in Zephyr's github issues?

Reply Children
Related