nRF54L15 crashes on debug when using TFM due to GDB triggering the MPC while reading call stacks

I'm building the nrf/samples/openthread/cli sample with the nrf54l15dk/nrf54l15/cpuapp/ns board. The only change I'm making is enabling CONFIG_DEBUG_THREAD_INFO.

The app runs fine until you pause execution in any way with the debugger. When GDB pauses execution, it reads the call stacks which eventually lead to a dummy return address since GDB doesn't know to stop at z_thread_entry. GDB tries to continue unwinding the call stack and ends up reading the memory at the dummy return address. This triggers the MPC to cause a memory access error. For instance, if CONFIG_INIT_STACKS is enabled, this results in a memory access error at 0xAAAAAAAA since the stack was initialized to that value.

I've verified that this is the cause by purposefully placing a custom canary value in the initial return address. When I do this, I can see that the memory access error always lines up with what I put there. I end up in a TF-M halt.

Additionally, although this is potentially hacky? You can do a similar thing, replacing the return address with the return address for main.

This takes advantage of GDB's built in functionality where it stops unwinding the stack after seeing the main symbol. This "fix" could have other side effects though that make it impractical.

This causes GDB to stop unwinding and prevents the memory access error.

The changes shown here are in zephyr/kernel/thread.c within the CONFIG_INIT_STACKS ifdef.

Parents

0 Håkon Alseth 6 months ago

Hi,

Thank you for sharing.

I took an arbitrary example, and used "west attach" to connect to an already running debug session, with this backtrace:

(gdb) bt
#0  __enable_irq () at /opt/ncs/modules/hal/cmsis/CMSIS/Core/Include/cmsis_gcc.h:951
#1  arch_cpu_idle () at /opt/ncs/zephyr/arch/arm/core/cortex_m/cpu_idle.c:104
#2  0x0000613c in k_cpu_idle () at /opt/ncs/zephyr/include/zephyr/kernel.h:6323
#3  idle (unused1=<optimized out>, unused2=<optimized out>, unused3=<optimized out>) at /opt/ncs/zephyr/kernel/idle.c:75
#4  0x00000d02 in z_thread_entry (entry=0x611d <idle>, p1=0x20000d18 <_kernel>, p2=0x0 <cbvprintf_package>, 
    p3=0x0 <cbvprintf_package>) at /opt/ncs/zephyr/lib/os/thread_entry.c:48
#5  0x00000000 in ?? ()

Now, I printed the MPC.EVENTS_MEMACCERR, MEMACCERR.*, with the addresses in the datasheet (https://docs.nordicsemi.com/bundle/ps_nrf54L15/page/mpc.html#ariaid-title4):

(gdb) print *0x50041100
$1 = 0x0
(gdb) print *0x50041400
$2 = 0x0
(gdb) print *0x50041404
$3 = 0x0

So far, so good.

Now, to trigger the MPC:

(gdb) print *0xaaaaaaaa
Cannot access memory at address 0xaaaaaaaa

And print the MPC registers mentioned above:

(gdb) print *0x50041404
$4 = 0x19041
(gdb) print *0x50041400
$5 = 0xaaaaaaaa
(gdb) print *0x50041100
$6 = 0x1

This behaviour correlates with your observation:

The app runs fine until you pause execution in any way with the debugger. When GDB pauses execution, it reads the call stacks which eventually lead to a dummy return address since GDB doesn't know to stop at z_thread_entry

With your configuration, ie. with secure/non-secure split:

nrf54l15dk/nrf54l15/cpuapp/ns board

secure side, ie. TFM, will own the MPC peripheral.

I took the hello_world sample, configured it with /ns board, and tried the same procedure. I can see that the MPC registers are set as before when doing a print *0xAAAAAAAA, and can confirm that TF-M will trigger a fault on this:

arch_cpu_idle () at /opt/ncs/zephyr/arch/arm/core/cortex_m/cpu_idle.c:104
104             __enable_irq();
(gdb) print *0xaaaaaaaa
Cannot access memory at address 0xaaaaaaaa
(gdb) c
Continuing.
^C
Program received signal SIGTRAP, Trace/breakpoint trap.
tfm_hal_system_halt () at /opt/ncs/nrf/modules/trusted-firmware-m/tfm_boards/common/tfm_hal_reset_halt.c:30
30              while (1) {
(gdb) bt
#0  tfm_hal_system_halt () at /opt/ncs/nrf/modules/trusted-firmware-m/tfm_boards/common/tfm_hal_reset_halt.c:30
#1  0x0000744c in tfm_core_panic () at /opt/ncs/modules/tee/tf-m/trusted-firmware-m/secure_fw/spm/core/utilities.c:26
#2  0x0000640a in MPC_Handler ()
    at /opt/ncs/modules/tee/tf-m/trusted-firmware-m/platform/ext/target/nordic_nrf/common/core/faults.c:102
Backtrace stopped: previous fra

Kind regards,

Håkon

0 michael.feist.etc 6 months ago in reply to Håkon Alseth

I tested the hello_world sample to simplify this as well. To make the crash more repeatable, it's helpful to have CONFIG_INIT_STACKS=y in order to force that return address to a bogus address that is unmapped in memory. If the address is 0x0, the MPC won't trigger. Any thoughts on whether this is worth bringing up in Zephyr's github issues?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 michael.feist.etc 6 months ago in reply to Håkon Alseth

I tested the hello_world sample to simplify this as well. To make the crash more repeatable, it's helpful to have CONFIG_INIT_STACKS=y in order to force that return address to a bogus address that is unmapped in memory. If the address is 0x0, the MPC won't trigger. Any thoughts on whether this is worth bringing up in Zephyr's github issues?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

0 Håkon Alseth 6 months ago in reply to michael.feist.etc

Hi,

The MPC is a generic Cortex M33 feature/peripheral; where as the memory map of a Cortex M device is implementation specific (ie. vendor and/or device specific), so changing the zephyr configuration might not be the best option.

The MPC peripheral does what it is supposed to do here, where it detects an out-of-bounds read, the MPC event is then triggered and TF-M has MPC_IRQn enabled.

I suspect the proper way around this would be to add a debug check in TF-M, similar to this one, with a guard of CONFIG_INIT_STACKS:

https://github.com/nrfconnect/sdk-trusted-firmware-m/blob/ncs-v3.2.0-rc1/platform/ext/target/nordic_nrf/common/core/faults.c#L111-L126

I will report this to the TF-M team internally, but CONFIG_INIT_STACKS hardcodes the value based on architecture, here for the cortex m: https://github.com/nrfconnect/sdk-zephyr/blob/main/arch/arm/core/cortex_m/reset.S#L204

This means that the value to check for is technically subject-to-change, and unless zephyr is changed; the magic value would also be copied into TF-M.

Kind regards,

Håkon
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel