Pinpointing hard faults caused during SoftDevice execution more effectively

Hi, I'm working on integrating the reference secure DFU bootloader from the nRF5 SDK (nRF5_SDK_17.1.0_ddde560/examples/dfu/secure_bootloader/pca10100_s140_ble_debug/armgcc) to my application, which is running on a custom nRF52833 board with SoftDevice s140 7.3.0. Integration is going well, other than a few issues that took a while to debug because the error would immediately cause the board to reboot.

Particularly, there was one hard fault issue that took almost a full day to trace down because the root cause of the hard fault was very difficult to trace down with the backtrace alone.

Here's the log from the bootloader leading up to that issue, with hard fault logging enabled:

<info> app: Inside main
<debug> app: In nrf_bootloader_init
<debug> nrf_dfu_settings: Calling nrf_dfu_settings_init()...
<debug> nrf_dfu_flash: Initializing nrf_fstorage_nvmc backend.
<debug> nrf_dfu_settings: Using settings page.
<debug> nrf_dfu_settings: Copying forbidden parts from backup page.
<debug> nrf_dfu_settings: Destination settings are identical to source, write not needed. Skipping.
<info> nrf_dfu_settings: Backing up settings page to address 0x7E000.
<debug> nrf_dfu_settings: Destination settings are identical to source, write not needed. Skipping.
<debug> app: Enter nrf_bootloader_fw_activate
<info> app: No firmware to activate.
<debug> app: App is valid
<debug> app: DFU mode requested via GPREGRET.
<info> nrf_bootloader_wdt: WDT is not enabled
<debug> app: in weak nrf_dfu_init_user
<debug> app: timer_stop (0x200068BC)
<debug> app: timer_activate (0x200068BC)
<info> app: Entering DFU mode.
<debug> app: Initializing transports (found: 1)
<debug> nrf_dfu_ble: Initializing BLE DFU transport
<debug> nrf_dfu_ble: Enabling SoftDevice.
<debug> nrf_dfu_ble: Configuring BLE stack.
<debug> nrf_dfu_ble: Enabling the BLE stack.
<debug> nrf_dfu_ble: No advertising name found
<debug> nrf_dfu_ble: Using default advertising name
<debug> nrf_dfu_ble: Advertising...
<debug> nrf_dfu_ble: BLE DFU transport initialized.
<debug> nrf_dfu_flash: Initializing nrf_fstorage_sd backend.
<debug> app: Enter main loop
<debug> nrf_dfu_ble: Connected
<debug> nrf_dfu_ble: Received BLE_GAP_EVT_CONN_PARAM_UPDATE
<debug> nrf_dfu_ble: max_conn_interval: 12
<debug> nrf_dfu_ble: min_conn_interval: 12
<debug> nrf_dfu_ble: slave_latency: 0
<debug> nrf_dfu_ble: conn_sup_timeout: 600
<debug> nrf_dfu_ble: Finished handling conn sec update
<debug> nrf_dfu_ble: Set receipt notif
<debug> nrf_dfu_req_handler: Handle NRF_DFU_OP_RECEIPT_NOTIF_SET
<debug> nrf_dfu_req_handler: Request handling complete. Result: 0x1
<debug> nrf_dfu_ble: Received BLE_GATTS_EVT_EXCHANGE_MTU_REQUEST (request: 247, reply: 247).
<debug> nrf_dfu_req_handler: Handle NRF_DFU_OP_OBJECT_SELECT (command)
<debug> nrf_dfu_req_handler: Request handling complete. Result: 0x1
<debug> app: Shutting down transports (found: 1)
<debug> nrf_dfu_req_handler: Handle NRF_DFU_OP_OBJECT_CREATE (command)
<debug> app: timer_stop (0x200068BC)
<debug> app: timer_activate (0x200068BC)
<debug> nrf_dfu_req_handler: Request handling complete. Result: 0x1
<debug> nrf_dfu_ble: Buffer 0x20007EEC acquired, len 145 (244)
<debug> nrf_dfu_req_handler: Handle NRF_DFU_OP_OBJECT_WRITE (command)
<debug> nrf_dfu_ble: Freeing buffer 0x20007EEC
<debug> nrf_dfu_req_handler: Request handling complete. Result: 0x1
<debug> nrf_dfu_req_handler: Handle NRF_DFU_OP_CRC_GET (command)
<debug> nrf_dfu_req_handler: Request handling complete. Result: 0x1
<debug> nrf_dfu_req_handler: Handle NRF_DFU_OP_OBJECT_EXECUTE (command)
<debug> nrf_dfu_validation: PB: Init packet data len: 68
<info> nrf_dfu_validation: Signature required. Checking signature.
<info> nrf_dfu_validation: Calculating hash (len: 68)
<info> nrf_dfu_validation: Verify signature
<info> nrf_dfu_validation: Image verified
<debug> app: Enter nrf_dfu_cache_prepare()
<debug> app: required_size: 0x395C4.
<debug> app: single_bank: true.
<debug> app: keep_app: false.
<debug> app: keep_softdevice: true.
<debug> app: SD_PRESENT: true.
<debug> app: Bank contents:
<debug> app: Bank 0 code: 0x01: Size: 0x395C4
<debug> app: Bank 1 code: 0x00: Size: 0x0
<debug> app: pass: 0.
<debug> app: cache_address: 0x61000.
<debug> app: cache_too_small: true.
<debug> app: keep_firmware: false.
<debug> app: delete_more: true.
<debug> app: pass: 1.
<debug> app: cache_address: 0x27000.
<debug> app: cache_too_small: false.
<debug> app: keep_firmware: true.
<debug> app: delete_more: true.
<debug> app: Invalidating app.
<debug> nrf_dfu_validation: Write address set to 0x00027000
<debug> nrf_dfu_settings: Writing settings...
<debug> nrf_dfu_settings: Erasing old settings at: 0x0007F000
<debug> nrf_dfu_flash: nrf_fstorage_erase(addr=0x0x0007F000, len=1 pages), queue usage: 0
<debug> nrf_dfu_flash: nrf_fstorage_write(addr=0x0007F000, src=0x200076AC, len=896 bytes), queue usage: 1
<info> nrf_dfu_settings: Backing up settings page to address 0x7E000.
<debug> nrf_dfu_settings: Writing settings...
<debug> nrf_dfu_settings: Erasing old settings at: 0x0007E000
<debug> nrf_dfu_flash: nrf_fstorage_erase(addr=0x0x0007E000, len=1 pages), queue usage: 2
<debug> nrf_dfu_flash: nrf_fstorage_write(addr=0x0007E000, src=0x20007328, len=896 bytes), queue usage: 3
<debug> nrf_dfu_req_handler: Writing valid init command to flash.
<debug> nrf_dfu_req_handler: Request handling complete. Result: 0x1
<debug> nrf_dfu_req_handler: Handle NRF_DFU_OP_OBJECT_SELECT (data)
<debug> nrf_dfu_req_handler: crc = 0x0, offset = 0x0, max_size = 0x1000
<debug> nrf_dfu_req_handler: Request handling complete. Result: 0x1
<debug> nrf_dfu_req_handler: Handle NRF_DFU_OP_OBJECT_SELECT (data)
<debug> nrf_dfu_req_handler: crc = 0x0, offset = 0x0, max_size = 0x1000
<debug> nrf_dfu_req_handler: Request handling complete. Result: 0x1
<debug> nrf_dfu_flash: Flash erase success: addr=0x0007F000, pending 3
<debug> nrf_dfu_req_handler: Handle NRF_DFU_OP_OBJECT_CREATE (data)
<debug> nrf_dfu_flash: nrf_fstorage_erase(addr=0x0x00027000, len=1 pages), queue usage: 3
<debug> nrf_dfu_req_handler: Creating object with size: 4096. Offset: 0x00000000, CRC: 0x00000000
<debug> nrf_dfu_req_handler: Request handling complete. Result: 0x1
<debug> nrf_dfu_flash: Flash write success: addr=0x0007F000, pending 3
<debug> nrf_dfu_ble: Buffer 0x20007EEC acquired, len 244 (244)
<debug> nrf_dfu_req_handler: Handle NRF_DFU_OP_OBJECT_WRITE (data)
<debug> nrf_dfu_flash: nrf_fstorage_write(addr=0x00027000, src=0x20007EEC, len=244 bytes), queue usage: 3
<debug> nrf_dfu_req_handler: Request handling complete. Result: 0x1
<debug> nrf_dfu_ble: Buffer 0x20007FE0 acquired, len 244 (244)
<debug> nrf_dfu_req_handler: Handle NRF_DFU_OP_OBJECT_WRITE (data)
<debug> nrf_dfu_flash: nrf_fstorage_write(addr=0x000270F4, src=0x20007FE0, len=244 bytes), queue usage: 4
<debug> nrf_dfu_req_handler: Request handling complete. Result: 0x1
<error> hardfault: HARD FAULT at 0x000258F2
<error> hardfault:   R0:  0x00000001  R1:  0x4001E400  R2:  0x4001E504  R3:  0x4001E508
<error> hardfault:   R12: 0x20006198  LR:  0x000245D3  PSR: 0x010F0018
<error> hardfault: Cause: Data bus error (return address in the stack frame is not related to the instruction that caused the error).

Breaking on HardFault_process, here's the backtrace from GDB:

Breakpoint 2, HardFault_process (p_stack=0x2001ff78) at ../../../../../components/libraries/hardfault/hardfault_implementation.c:56
56      __WEAK void HardFault_process(HardFault_stack_t * p_stack)
(gdb) bt
#0  HardFault_process (p_stack=0x2001ff78) at ../../../../../components/libraries/hardfault/hardfault_implementation.c:56
#1  HardFault_c_handler (p_stack_address=0x2001ff78) at ../../../../../components/libraries/hardfault/hardfault_implementation.c:154
#2  <signal handler called>
#3  0x000258f2 in ?? ()
#4  0x000245d2 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

As you can see, execution is stuck inside the SoftDevice, and it's hard to figure out what is going on. The backtrace is all within the black-box SoftDevice space.

The root cause turned out to be that ACL was still set to write-protect Bank 0 from the initial app load from bootloader: right before the bootloader executes the app, it write-protects Bank 0 by default. When the app transitions back to the bootloader, it needs to perform a full reset (NVIC_SystemReset) to clear the protection, rather than a jump to the bootloader address. This way, Bank 0 is available for a single-bank update, since my firmware is too big for a dual-bank update.

Retrospectively, before I had added the debug hard fault handler, I had missed one clue from a register dump when the hard fault is hit:

(gdb) c
Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap.
NMI_Handler () at ../../../../../modules/nrfx/mdk/gcc_startup_nrf52833.S:302
302         b       .
(gdb) bt
#0  NMI_Handler () at ../../../../../modules/nrfx/mdk/gcc_startup_nrf52833.S:302
#1  <signal handler called>
#2  0x000258f2 in ?? ()
#3  0x000245d2 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) i r
r0             0x68                104
r1             0x258f0             153840
r2             0x27000             159744
r3             0x722db             467675
r4             0x200001b0          536871344
r5             0x1                 1
r6             0xc5c10             810000
r7             0x1                 1
r8             0x1                 1
r9             0x0                 0
r10            0x73fa5             475045
r11            0x7d2b8             512696
r12            0x200060b3          536895667
sp             0x2001ff88          0x2001ff88
lr             0xfffffff1          4294967281
pc             0x722da             0x722da <HardFault_Handler>
xpsr           0x210f0003          554631171
msp            0x2001ff88          537001864
psp            0x20010830          536938544
primask        0x0                 0
basepri        0x0                 0
faultmask      0x0                 0
control        0x0                 0
fpscr          0x0                 0

r3 is set to 0x27000, which lines up with the start of Bank 0 in flash. The crash was delayed because fstorage writes are queued, and flash operations are handled (asynchronously?) by the SoftDevice. However, it seems like the SoftDevice doesn't have a way of signaling clearly, "Hey, this hard fault happened during X operation." Or, when flash erase/writes are happening, the SoftDevice should handle hard faults momentarily to catch the error so it can be sent back as an event. Otherwise, the context of the error is lost.

Is there an easier way to see what a hard fault was caused by when execution is occurring in the SoftDevice? Is there a documented way to determine what type of operation the SoftDevice was executing?

Thanks in advance.

Parents
  • However, it seems like the SoftDevice doesn't have a way of signaling clearly, "Hey, this hard fault happened during X operation." Or, when flash erase/writes are happening, the SoftDevice should handle hard faults momentarily to catch the error so it can be sent back as an event. Otherwise, the context of the error is lost.

    Softdevice do have a way to signal faults through the app_error_fault_handler if the softdevice has asserted or there is a hardfault in the softdevice. But I think in this case the whole debug method is based on the fact that you think the hardfault info that is pulled from the stack is uncorrupted. 

    Both the logs and you GDB tells you that the stack frame does not look sane

    <error> hardfault: Cause: Data bus error (return address in the stack frame is not related to the instruction that caused the error).
    
    Backtrace stopped: previous frame identical to this frame (corrupt stack?)

    These normally means that the stack frame does not look sane and most likely corrupted. In such cases hardfault debugging becomes even more manual and there is given way to debug. The stack frame could be corrupted by many reasons and it is not necessarily caused by softdevice. 

Reply
  • However, it seems like the SoftDevice doesn't have a way of signaling clearly, "Hey, this hard fault happened during X operation." Or, when flash erase/writes are happening, the SoftDevice should handle hard faults momentarily to catch the error so it can be sent back as an event. Otherwise, the context of the error is lost.

    Softdevice do have a way to signal faults through the app_error_fault_handler if the softdevice has asserted or there is a hardfault in the softdevice. But I think in this case the whole debug method is based on the fact that you think the hardfault info that is pulled from the stack is uncorrupted. 

    Both the logs and you GDB tells you that the stack frame does not look sane

    <error> hardfault: Cause: Data bus error (return address in the stack frame is not related to the instruction that caused the error).
    
    Backtrace stopped: previous frame identical to this frame (corrupt stack?)

    These normally means that the stack frame does not look sane and most likely corrupted. In such cases hardfault debugging becomes even more manual and there is given way to debug. The stack frame could be corrupted by many reasons and it is not necessarily caused by softdevice. 

Children
  • Hi Susheel,

    Thanks for your answer. This reference bootloader already contains an app_error_fault_handler method (main.c:86). It was not triggered.

    When the error message says "return address in the stack frame is not related to the instruction that caused the error," I believe it is referring to the fact that this data bus error is an asynchronous bus fault (IMPRECISERR) and therefore the instruction that the processor stopped at may not have been the instruction that caused the error.

    "Backtrace stopped: previous frame identical to this frame (corrupt stack?)" is a generic note from GDB, and it doesn't mean that the stack was corrupted. The stack can change due to context switches, and I usually ignore this message unless the backtrace looks really messed up with garbage addresses. As the SoftDevice is a hard real time system that raises and handles interrupts, it makes sense to me that the stack only traces back to the SoftDevice.

    Maybe in the future, I should try unwinding the stack pointed to by the PSP and seeing what's there, since that's application code and it'll give me the general area of where the fault occurred on the application side.

    Apparently, the switch between PSP and MSP does trip people up and cause them to believe that the stack might have been messed up somehow.

  • oldmud0,

    A great number of times I have had been dealing with stack corrupted hardfaults and hence was thinking in that direction aswell. I haven't used the GDB that much myself but I believe that you have seen this message about "backtrace stopped" enough number of times with valid stack frames. So I trust you and believe that the stack frame is still sane.

    oldmud0 said:
    Maybe in the future, I should try unwinding the stack pointed to by the PSP and seeing what's there, since that's application code and it'll give me the general area of where the fault occurred on the application side.

    OK, you seem to know your stuff. That might be helpful actually if you put a breakpoint in components\libraries\hardfault\nrf52\handler\hardfault_handler_gcc.c and get the application context from there instead of the softdevice context that caused this hardfault.

Related