Pinpointing hard faults caused during SoftDevice execution more effectively

Question

Hi, I'm working on integrating the reference secure DFU bootloader from the nRF5 SDK (nRF5_SDK_17.1.0_ddde560/examples/dfu/secure_bootloader/pca10100_s140_ble_debug/armgcc) to my application, which is running on a custom nRF52833 board with SoftDevice s140 7.3.0. Integration is going well, other than a few issues that took a while to debug because the error would immediately cause the board to reboot.

Particularly, there was one hard fault issue that took almost a full day to trace down because the root cause of the hard fault was very difficult to trace down with the backtrace alone.

Here's the log from the bootloader leading up to that issue, with hard fault logging enabled:

<info> app: Inside main
<debug> app: In nrf_bootloader_init
<debug> nrf_dfu_settings: Calling nrf_dfu_settings_init()...
<debug> nrf_dfu_flash: Initializing nrf_fstorage_nvmc backend.
<debug> nrf_dfu_settings: Using settings page.
<debug> nrf_dfu_settings: Copying forbidden parts from backup page.
<debug> nrf_dfu_settings: Destination settings are identical to source, write not needed. Skipping.
<info> nrf_dfu_settings: Backing up settings page to address 0x7E000.
<debug> nrf_dfu_settings: Destination settings are identical to source, write not needed. Skipping.
<debug> app: Enter nrf_bootloader_fw_activate
<info> app: No firmware to activate.
<debug> app: App is valid
<debug> app: DFU mode requested via GPREGRET.
<info> nrf_bootloader_wdt: WDT is not enabled
<debug> app: in weak nrf_dfu_init_user
<debug> app: timer_stop (0x200068BC)
<debug> app: timer_activate (0x200068BC)
<info> app: Entering DFU mode.
<debug> app: Initializing transports (found: 1)
<debug> nrf_dfu_ble: Initializing BLE DFU transport
<debug> nrf_dfu_ble: Enabling SoftDevice.
<debug> nrf_dfu_ble: Configuring BLE stack.
<debug> nrf_dfu_ble: Enabling the BLE stack.
<debug> nrf_dfu_ble: No advertising name found
<debug> nrf_dfu_ble: Using default advertising name
<debug> nrf_dfu_ble: Advertising...
<debug> nrf_dfu_ble: BLE DFU transport initialized.
<debug> nrf_dfu_flash: Initializing nrf_fstorage_sd backend.
<debug> app: Enter main loop
<debug> nrf_dfu_ble: Connected
<debug> nrf_dfu_ble: Received BLE_GAP_EVT_CONN_PARAM_UPDATE
<debug> nrf_dfu_ble: max_conn_interval: 12
<debug> nrf_dfu_ble: min_conn_interval: 12
<debug> nrf_dfu_ble: slave_latency: 0
<debug> nrf_dfu_ble: conn_sup_timeout: 600
<debug> nrf_dfu_ble: Finished handling conn sec update
<debug> nrf_dfu_ble: Set receipt notif
<debug> nrf_dfu_req_handler: Handle NRF_DFU_OP_RECEIPT_NOTIF_SET
<debug> nrf_dfu_req_handler: Request handling complete. Result: 0x1
<debug> nrf_dfu_ble: Received BLE_GATTS_EVT_EXCHANGE_MTU_REQUEST (request: 247, reply: 247).
<debug> nrf_dfu_req_handler: Handle NRF_DFU_OP_OBJECT_SELECT (command)
<debug> nrf_dfu_req_handler: Request handling complete. Result: 0x1
<debug> app: Shutting down transports (found: 1)
<debug> nrf_dfu_req_handler: Handle NRF_DFU_OP_OBJECT_CREATE (command)
<debug> app: timer_stop (0x200068BC)
<debug> app: timer_activate (0x200068BC)
<debug> nrf_dfu_req_handler: Request handling complete. Result: 0x1
<debug> nrf_dfu_ble: Buffer 0x20007EEC acquired, len 145 (244)
<debug> nrf_dfu_req_handler: Handle NRF_DFU_OP_OBJECT_WRITE (command)
<debug> nrf_dfu_ble: Freeing buffer 0x20007EEC
<debug> nrf_dfu_req_handler: Request handling complete. Result: 0x1
<debug> nrf_dfu_req_handler: Handle NRF_DFU_OP_CRC_GET (command)
<debug> nrf_dfu_req_handler: Request handling complete. Result: 0x1
<debug> nrf_dfu_req_handler: Handle NRF_DFU_OP_OBJECT_EXECUTE (command)
<debug> nrf_dfu_validation: PB: Init packet data len: 68
<info> nrf_dfu_validation: Signature required. Checking signature.
<info> nrf_dfu_validation: Calculating hash (len: 68)
<info> nrf_dfu_validation: Verify signature
<info> nrf_dfu_validation: Image verified
<debug> app: Enter nrf_dfu_cache_prepare()
<debug> app: required_size: 0x395C4.
<debug> app: single_bank: true.
<debug> app: keep_app: false.
<debug> app: keep_softdevice: true.
<debug> app: SD_PRESENT: true.
<debug> app: Bank contents:
<debug> app: Bank 0 code: 0x01: Size: 0x395C4
<debug> app: Bank 1 code: 0x00: Size: 0x0
<debug> app: pass: 0.
<debug> app: cache_address: 0x61000.
<debug> app: cache_too_small: true.
<debug> app: keep_firmware: false.
<debug> app: delete_more: true.
<debug> app: pass: 1.
<debug> app: cache_address: 0x27000.
<debug> app: cache_too_small: false.
<debug> app: keep_firmware: true.
<debug> app: delete_more: true.
<debug> app: Invalidating app.
<debug> nrf_dfu_validation: Write address set to 0x00027000
<debug> nrf_dfu_settings: Writing settings...
<debug> nrf_dfu_settings: Erasing old settings at: 0x0007F000
<debug> nrf_dfu_flash: nrf_fstorage_erase(addr=0x0x0007F000, len=1 pages), queue usage: 0
<debug> nrf_dfu_flash: nrf_fstorage_write(addr=0x0007F000, src=0x200076AC, len=896 bytes), queue usage: 1
<info> nrf_dfu_settings: Backing up settings page to address 0x7E000.
<debug> nrf_dfu_settings: Writing settings...
<debug> nrf_dfu_settings: Erasing old settings at: 0x0007E000
<debug> nrf_dfu_flash: nrf_fstorage_erase(addr=0x0x0007E000, len=1 pages), queue usage: 2
<debug> nrf_dfu_flash: nrf_fstorage_write(addr=0x0007E000, src=0x20007328, len=896 bytes), queue usage: 3
<debug> nrf_dfu_req_handler: Writing valid init command to flash.
<debug> nrf_dfu_req_handler: Request handling complete. Result: 0x1
<debug> nrf_dfu_req_handler: Handle NRF_DFU_OP_OBJECT_SELECT (data)
<debug> nrf_dfu_req_handler: crc = 0x0, offset = 0x0, max_size = 0x1000
<debug> nrf_dfu_req_handler: Request handling complete. Result: 0x1
<debug> nrf_dfu_req_handler: Handle NRF_DFU_OP_OBJECT_SELECT (data)
<debug> nrf_dfu_req_handler: crc = 0x0, offset = 0x0, max_size = 0x1000
<debug> nrf_dfu_req_handler: Request handling complete. Result: 0x1
<debug> nrf_dfu_flash: Flash erase success: addr=0x0007F000, pending 3
<debug> nrf_dfu_req_handler: Handle NRF_DFU_OP_OBJECT_CREATE (data)
<debug> nrf_dfu_flash: nrf_fstorage_erase(addr=0x0x00027000, len=1 pages), queue usage: 3
<debug> nrf_dfu_req_handler: Creating object with size: 4096. Offset: 0x00000000, CRC: 0x00000000
<debug> nrf_dfu_req_handler: Request handling complete. Result: 0x1
<debug> nrf_dfu_flash: Flash write success: addr=0x0007F000, pending 3
<debug> nrf_dfu_ble: Buffer 0x20007EEC acquired, len 244 (244)
<debug> nrf_dfu_req_handler: Handle NRF_DFU_OP_OBJECT_WRITE (data)
<debug> nrf_dfu_flash: nrf_fstorage_write(addr=0x00027000, src=0x20007EEC, len=244 bytes), queue usage: 3
<debug> nrf_dfu_req_handler: Request handling complete. Result: 0x1
<debug> nrf_dfu_ble: Buffer 0x20007FE0 acquired, len 244 (244)
<debug> nrf_dfu_req_handler: Handle NRF_DFU_OP_OBJECT_WRITE (data)
<debug> nrf_dfu_flash: nrf_fstorage_write(addr=0x000270F4, src=0x20007FE0, len=244 bytes), queue usage: 4
<debug> nrf_dfu_req_handler: Request handling complete. Result: 0x1
<error> hardfault: HARD FAULT at 0x000258F2
<error> hardfault:   R0:  0x00000001  R1:  0x4001E400  R2:  0x4001E504  R3:  0x4001E508
<error> hardfault:   R12: 0x20006198  LR:  0x000245D3  PSR: 0x010F0018
<error> hardfault: Cause: Data bus error (return address in the stack frame is not related to the instruction that caused the error).

Breaking on HardFault_process, here's the backtrace from GDB:

Breakpoint 2, HardFault_process (p_stack=0x2001ff78) at ../../../../../components/libraries/hardfault/hardfault_implementation.c:56
56      __WEAK void HardFault_process(HardFault_stack_t * p_stack)
(gdb) bt
#0  HardFault_process (p_stack=0x2001ff78) at ../../../../../components/libraries/hardfault/hardfault_implementation.c:56
#1  HardFault_c_handler (p_stack_address=0x2001ff78) at ../../../../../components/libraries/hardfault/hardfault_implementation.c:154
#2  <signal handler called>
#3  0x000258f2 in ?? ()
#4  0x000245d2 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

As you can see, execution is stuck inside the SoftDevice, and it's hard to figure out what is going on. The backtrace is all within the black-box SoftDevice space.

The root cause turned out to be that ACL was still set to write-protect Bank 0 from the initial app load from bootloader: right before the bootloader executes the app, it write-protects Bank 0 by default. When the app transitions back to the bootloader, it needs to perform a full reset (NVIC_SystemReset) to clear the protection, rather than a jump to the bootloader address. This way, Bank 0 is available for a single-bank update, since my firmware is too big for a dual-bank update.

Retrospectively, before I had added the debug hard fault handler, I had missed one clue from a register dump when the hard fault is hit:

(gdb) c
Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap.
NMI_Handler () at ../../../../../modules/nrfx/mdk/gcc_startup_nrf52833.S:302
302         b       .
(gdb) bt
#0  NMI_Handler () at ../../../../../modules/nrfx/mdk/gcc_startup_nrf52833.S:302
#1  <signal handler called>
#2  0x000258f2 in ?? ()
#3  0x000245d2 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) i r
r0             0x68                104
r1             0x258f0             153840
r2             0x27000             159744
r3             0x722db             467675
r4             0x200001b0          536871344
r5             0x1                 1
r6             0xc5c10             810000
r7             0x1                 1
r8             0x1                 1
r9             0x0                 0
r10            0x73fa5             475045
r11            0x7d2b8             512696
r12            0x200060b3          536895667
sp             0x2001ff88          0x2001ff88
lr             0xfffffff1          4294967281
pc             0x722da             0x722da <HardFault_Handler>
xpsr           0x210f0003          554631171
msp            0x2001ff88          537001864
psp            0x20010830          536938544
primask        0x0                 0
basepri        0x0                 0
faultmask      0x0                 0
control        0x0                 0
fpscr          0x0                 0

r3 is set to 0x27000, which lines up with the start of Bank 0 in flash. The crash was delayed because fstorage writes are queued, and flash operations are handled (asynchronously?) by the SoftDevice. However, it seems like the SoftDevice doesn't have a way of signaling clearly, "Hey, this hard fault happened during X operation." Or, when flash erase/writes are happening, the SoftDevice should handle hard faults momentarily to catch the error so it can be sent back as an event. Otherwise, the context of the error is lost.

Is there an easier way to see what a hard fault was caused by when execution is occurring in the SoftDevice? Is there a documented way to determine what type of operation the SoftDevice was executing?

Thanks in advance.