Device crash when doing a software reset

I'm developing a software with the nRF Connect SDK v2.5.1. This software runs Matter on the nRF5340.

I need to reset the MCU from my app. To do so, I use the sys_reboot(SYS_REBOOT_COLD) function. But this makes the OS crash with the following trace:

uart:~$ E: IPC endpoint bind timed out
ASSERTION FAIL @ WEST_TOPDIR/zephyr/drivers/ieee802154/ieee802154_nrf5.c:1153
E: r0/a1:  0x00000004  r1/a2:  0x00000481  r2/a3:  0x2000ca30
E: r3/a4:  0x00000000 r12/ip:  0x00000000 r14/lr:  0x00023e0f
E:  xpsr:  0x69100000
E: s[ 0]:  0xffffffff  s[ 1]:  0x000134c5  s[ 2]:  0x00000000  s[ 3]:  0x00038607
E: s[ 4]:  0x00008000  s[ 5]:  0x000aeab4  s[ 6]:  0x000a6548  s[ 7]:  0x000aebf0
E: s[ 8]:  0x00000000  s[ 9]:  0x00072377  s[10]:  0x00008000  s[11]:  0x20027404
E: s[12]:  0x20002f40  s[13]:  0x00023e05  s[14]:  0x000ac36c  s[15]:  0x000aeab4
E: fpscr:  0x00000481
E: Faulting instruction address (r15/pc): 0x00072362
E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
E: Current thread: 0x2000ca30 (main)
E: Halting system

Is there a way to make it better ?

Parents
  • Based on the limited info provided, it seems like the main thread might have a stack overflow.

    Try increasing the CONFIG_MAIN_STACK_SIZE in your prj.conf and see if the assert goes away. 

    You need to understand all the contexts (RTOS threads and interrupts) on our system and get an overview on the memory usage by them. While prototyping it might be a good idea to enable THREAD_ANALYZER

  • I already tried to increase the main stack size and the work queue size, but it doesn't solve the issue.

    If it can help, the issue happens only when the device is commissioned and bonded in a Matter over Thread fabric. If the device is "offline", the reset procedure goes fine.

  • Have you enabled the Thread analyzer? Have you seen if there are any other threads that are using closer to its stack limit?  Can you post your Thread analyzer output just before this hardfault happened? If the stack looks good then we can look past the stack overflow and see what caused this hardfault.

  • I enabled it now with CONFIG_THREAD_ANALYZER and CONFIG_THREAD_ANALYZER_AUTO.

    Here is the log before it dies:

     spinel_packet_send_thread: STACK: unused 840 usage 184 / 1024 (17 %); CPU: 0 %
          : Total CPU cycles used: 0
    
     rx_q[0]             : STACK: unused 1288 usage 248 / 1536 (16 %); CPU: 0 %
          : Total CPU cycles used: 4
    
     openthread          : STACK: unused 2712 usage 1320 / 4032 (32 %); CPU: 2 %
          : Total CPU cycles used: 987
    
     ot_radio_workq      : STACK: unused 208 usage 816 / 1024 (79 %); CPU: 0 %
          : Total CPU cycles used: 122
    
     nrf5_rx             : STACK: unused 272 usage 688 / 960 (71 %); CPU: 0 %
          : Total CPU cycles used: 27
    
     0x20006a88          : STACK: unused 468 usage 492 / 960 (51 %); CPU: 1 %
          : Total CPU cycles used: 413
    
     sysworkq            : STACK: unused 1856 usage 192 / 2048 (9 %); CPU: 0 %
          : Total CPU cycles used: 0
    
     shell_uart          : STACK: unused 1776 usage 272 / 2048 (13 %); CPU: 0 %
          : Total CPU cycles used: 3
    
     idle                : STACK: unused 952 usage 72 / 1024 (7 %); CPU: 0 %
          : Total CPU cycles used: 313
    
     main                : STACK: unused 1248 usage 2784 / 4032 (69 %); CPU: 41 %
          : Total CPU cycles used: 15587
    
     ISR0                : STACK: unused 856 usage 1192 / 2048 (58 %)
     
    I: Received command over UART
    W: Device will restart !
    
    uart:~$ E: IPC endpoint bind timed out
    ASSERTION FAIL @ WEST_TOPDIR/zephyr/drivers/ieee802154/ieee802154_nrf5.c:1153
    E: r0/a1:  0x00000004  r1/a2:  0x00000481  r2/a3:  0x2000cc18
    E: r3/a4:  0x00000000 r12/ip:  0x00000000 r14/lr:  0x00023fa3
    E:  xpsr:  0x69100000
    E: s[ 0]:  0xffffffff  s[ 1]:  0x000134e1  s[ 2]:  0x00000000  s[ 3]:  0x000388c3
    E: s[ 4]:  0x00008000  s[ 5]:  0x000af25c  s[ 6]:  0x000a6b84  s[ 7]:  0x000af398
    E: s[ 8]:  0x00000000  s[ 9]:  0x000728a3  s[10]:  0x00008000  s[11]:  0x20027a84
    E: s[12]:  0x20002f40  s[13]:  0x00023f99  s[14]:  0x000aca20  s[15]:  0x000af25c
    E: fpscr:  0x00000481
    E: Faulting instruction address (r15/pc): 0x0007288e
    E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
    E: Current thread: 0x2000cc18 (main)
    E: Halting system

  • The nrf5_rx stack size (CONFIG_IEEE802154_NRF5_RX_STACK_SIZE) and ot_radio_workq stack size (CONFIG_OPENTHREAD_RADIO_WORKQUEUE_STACK_SIZE) seems a bit suspicious as well. Can you increase that aswell and see if it is the same behavior? If Yes, Then can you give steps to reproduce.

  • The changes doesn't improve the result, here is the log with thread analysis a few seconds before the crash.

    uart:~$ Thread analyze:
     CHIP                : STACK: unused 4360 usage 1720 / 6080 (28 %); CPU: 1 %
          : Total CPU cycles used: 16408
     BT RX               : STACK: unused 1008 usage 192 / 1200 (16 %); CPU: 0 %
          : Total CPU cycles used: 0
     BT TX               : STACK: unused 696 usage 328 / 1024 (32 %); CPU: 0 %
          : Total CPU cycles used: 54
     thread_command      : STACK: unused 816 usage 208 / 1024 (20 %); CPU: 18 %
          : Total CPU cycles used: 256407
     thread_analyzer     : STACK: unused 544 usage 480 / 1024 (46 %); CPU: 0 %
          : Total CPU cycles used: 6329
     spinel_packet_send_thread: STACK: unused 840 usage 184 / 1024 (17 %); CPU: 0 %
          : Total CPU cycles used: 0
     rx_q[0]             : STACK: unused 1288 usage 248 / 1536 (16 %); CPU: 0 %
          : Total CPU cycles used: 9
     openthread          : STACK: unused 2032 usage 2000 / 4032 (49 %); CPU: 0 %
          : Total CPU cycles used: 4657
     ot_radio_workq      : STACK: unused 1168 usage 816 / 1984 (41 %); CPU: 0 %
          : Total CPU cycles used: 1553
     nrf5_rx             : STACK: unused 1296 usage 688 / 1984 (34 %); CPU: 0 %
          : Total CPU cycles used: 39
     0x20006e88          : STACK: unused 336 usage 624 / 960 (65 %); CPU: 0 %
          : Total CPU cycles used: 2303
     sysworkq            : STACK: unused 1856 usage 192 / 2048 (9 %); CPU: 0 %
          : Total CPU cycles used: 1
     shell_uart          : STACK: unused 1712 usage 336 / 2048 (16 %); CPU: 0 %
          : Total CPU cycles used: 43
     idle                : STACK: unused 184 usage 72 / 256 (28 %); CPU: 59 %
          : Total CPU cycles used: 838850
     main                : STACK: unused 1248 usage 2784 / 4032 (69 %); CPU: 19 %
          : Total CPU cycles used: 272747
     ISR0                : STACK: unused 856 usage 1192 / 2048 (58 %)
    I: Received command over UART
    W: Device will restart !
    
    uart:~$ E: IPC endpoint bind timed out
    ASSERTION FAIL @ WEST_TOPDIR/zephyr/drivers/ieee802154/ieee802154_nrf5.c:1153
    E: r0/a1:  0x00000004  r1/a2:  0x00000481  r2/a3:  0x2000d018
    E: r3/a4:  0x00000000 r12/ip:  0x00000000 r14/lr:  0x00023fa3
    E:  xpsr:  0x69100000
    E: s[ 0]:  0xffffffff  s[ 1]:  0x000134e1  s[ 2]:  0x00000000  s[ 3]:  0x000388c3
    E: s[ 4]:  0x00008000  s[ 5]:  0x000af25c  s[ 6]:  0x000a6b84  s[ 7]:  0x000af398
    E: s[ 8]:  0x00000000  s[ 9]:  0x000728a3  s[10]:  0x00008000  s[11]:  0x20026344
    E: s[12]:  0x20002f40  s[13]:  0x00023f99  s[14]:  0x000aca20  s[15]:  0x000af25c
    E: fpscr:  0x00000481
    E: Faulting instruction address (r15/pc): 0x0007288e
    E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
    E: Current thread: 0x2000d018 (main)
    E: Halting system
    

    I think that to reproduce, you can use the Matter light bulb sample on a nRF5340DK, commission it into a Matter over Thread fabric then trigger the reset from a command or such.

Reply
  • The changes doesn't improve the result, here is the log with thread analysis a few seconds before the crash.

    uart:~$ Thread analyze:
     CHIP                : STACK: unused 4360 usage 1720 / 6080 (28 %); CPU: 1 %
          : Total CPU cycles used: 16408
     BT RX               : STACK: unused 1008 usage 192 / 1200 (16 %); CPU: 0 %
          : Total CPU cycles used: 0
     BT TX               : STACK: unused 696 usage 328 / 1024 (32 %); CPU: 0 %
          : Total CPU cycles used: 54
     thread_command      : STACK: unused 816 usage 208 / 1024 (20 %); CPU: 18 %
          : Total CPU cycles used: 256407
     thread_analyzer     : STACK: unused 544 usage 480 / 1024 (46 %); CPU: 0 %
          : Total CPU cycles used: 6329
     spinel_packet_send_thread: STACK: unused 840 usage 184 / 1024 (17 %); CPU: 0 %
          : Total CPU cycles used: 0
     rx_q[0]             : STACK: unused 1288 usage 248 / 1536 (16 %); CPU: 0 %
          : Total CPU cycles used: 9
     openthread          : STACK: unused 2032 usage 2000 / 4032 (49 %); CPU: 0 %
          : Total CPU cycles used: 4657
     ot_radio_workq      : STACK: unused 1168 usage 816 / 1984 (41 %); CPU: 0 %
          : Total CPU cycles used: 1553
     nrf5_rx             : STACK: unused 1296 usage 688 / 1984 (34 %); CPU: 0 %
          : Total CPU cycles used: 39
     0x20006e88          : STACK: unused 336 usage 624 / 960 (65 %); CPU: 0 %
          : Total CPU cycles used: 2303
     sysworkq            : STACK: unused 1856 usage 192 / 2048 (9 %); CPU: 0 %
          : Total CPU cycles used: 1
     shell_uart          : STACK: unused 1712 usage 336 / 2048 (16 %); CPU: 0 %
          : Total CPU cycles used: 43
     idle                : STACK: unused 184 usage 72 / 256 (28 %); CPU: 59 %
          : Total CPU cycles used: 838850
     main                : STACK: unused 1248 usage 2784 / 4032 (69 %); CPU: 19 %
          : Total CPU cycles used: 272747
     ISR0                : STACK: unused 856 usage 1192 / 2048 (58 %)
    I: Received command over UART
    W: Device will restart !
    
    uart:~$ E: IPC endpoint bind timed out
    ASSERTION FAIL @ WEST_TOPDIR/zephyr/drivers/ieee802154/ieee802154_nrf5.c:1153
    E: r0/a1:  0x00000004  r1/a2:  0x00000481  r2/a3:  0x2000d018
    E: r3/a4:  0x00000000 r12/ip:  0x00000000 r14/lr:  0x00023fa3
    E:  xpsr:  0x69100000
    E: s[ 0]:  0xffffffff  s[ 1]:  0x000134e1  s[ 2]:  0x00000000  s[ 3]:  0x000388c3
    E: s[ 4]:  0x00008000  s[ 5]:  0x000af25c  s[ 6]:  0x000a6b84  s[ 7]:  0x000af398
    E: s[ 8]:  0x00000000  s[ 9]:  0x000728a3  s[10]:  0x00008000  s[11]:  0x20026344
    E: s[12]:  0x20002f40  s[13]:  0x00023f99  s[14]:  0x000aca20  s[15]:  0x000af25c
    E: fpscr:  0x00000481
    E: Faulting instruction address (r15/pc): 0x0007288e
    E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
    E: Current thread: 0x2000d018 (main)
    E: Halting system
    

    I think that to reproduce, you can use the Matter light bulb sample on a nRF5340DK, commission it into a Matter over Thread fabric then trigger the reset from a command or such.

Children
Related