Device crash when doing a software reset

I'm developing a software with the nRF Connect SDK v2.5.1. This software runs Matter on the nRF5340.

I need to reset the MCU from my app. To do so, I use the sys_reboot(SYS_REBOOT_COLD) function. But this makes the OS crash with the following trace:

uart:~$ E: IPC endpoint bind timed out
ASSERTION FAIL @ WEST_TOPDIR/zephyr/drivers/ieee802154/ieee802154_nrf5.c:1153
E: r0/a1:  0x00000004  r1/a2:  0x00000481  r2/a3:  0x2000ca30
E: r3/a4:  0x00000000 r12/ip:  0x00000000 r14/lr:  0x00023e0f
E:  xpsr:  0x69100000
E: s[ 0]:  0xffffffff  s[ 1]:  0x000134c5  s[ 2]:  0x00000000  s[ 3]:  0x00038607
E: s[ 4]:  0x00008000  s[ 5]:  0x000aeab4  s[ 6]:  0x000a6548  s[ 7]:  0x000aebf0
E: s[ 8]:  0x00000000  s[ 9]:  0x00072377  s[10]:  0x00008000  s[11]:  0x20027404
E: s[12]:  0x20002f40  s[13]:  0x00023e05  s[14]:  0x000ac36c  s[15]:  0x000aeab4
E: fpscr:  0x00000481
E: Faulting instruction address (r15/pc): 0x00072362
E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
E: Current thread: 0x2000ca30 (main)
E: Halting system

Is there a way to make it better ?

  • Based on the limited info provided, it seems like the main thread might have a stack overflow.

    Try increasing the CONFIG_MAIN_STACK_SIZE in your prj.conf and see if the assert goes away. 

    You need to understand all the contexts (RTOS threads and interrupts) on our system and get an overview on the memory usage by them. While prototyping it might be a good idea to enable THREAD_ANALYZER

  • I already tried to increase the main stack size and the work queue size, but it doesn't solve the issue.

    If it can help, the issue happens only when the device is commissioned and bonded in a Matter over Thread fabric. If the device is "offline", the reset procedure goes fine.

  • Have you enabled the Thread analyzer? Have you seen if there are any other threads that are using closer to its stack limit?  Can you post your Thread analyzer output just before this hardfault happened? If the stack looks good then we can look past the stack overflow and see what caused this hardfault.

  • I enabled it now with CONFIG_THREAD_ANALYZER and CONFIG_THREAD_ANALYZER_AUTO.

    Here is the log before it dies:

     spinel_packet_send_thread: STACK: unused 840 usage 184 / 1024 (17 %); CPU: 0 %
          : Total CPU cycles used: 0
    
     rx_q[0]             : STACK: unused 1288 usage 248 / 1536 (16 %); CPU: 0 %
          : Total CPU cycles used: 4
    
     openthread          : STACK: unused 2712 usage 1320 / 4032 (32 %); CPU: 2 %
          : Total CPU cycles used: 987
    
     ot_radio_workq      : STACK: unused 208 usage 816 / 1024 (79 %); CPU: 0 %
          : Total CPU cycles used: 122
    
     nrf5_rx             : STACK: unused 272 usage 688 / 960 (71 %); CPU: 0 %
          : Total CPU cycles used: 27
    
     0x20006a88          : STACK: unused 468 usage 492 / 960 (51 %); CPU: 1 %
          : Total CPU cycles used: 413
    
     sysworkq            : STACK: unused 1856 usage 192 / 2048 (9 %); CPU: 0 %
          : Total CPU cycles used: 0
    
     shell_uart          : STACK: unused 1776 usage 272 / 2048 (13 %); CPU: 0 %
          : Total CPU cycles used: 3
    
     idle                : STACK: unused 952 usage 72 / 1024 (7 %); CPU: 0 %
          : Total CPU cycles used: 313
    
     main                : STACK: unused 1248 usage 2784 / 4032 (69 %); CPU: 41 %
          : Total CPU cycles used: 15587
    
     ISR0                : STACK: unused 856 usage 1192 / 2048 (58 %)
     
    I: Received command over UART
    W: Device will restart !
    
    uart:~$ E: IPC endpoint bind timed out
    ASSERTION FAIL @ WEST_TOPDIR/zephyr/drivers/ieee802154/ieee802154_nrf5.c:1153
    E: r0/a1:  0x00000004  r1/a2:  0x00000481  r2/a3:  0x2000cc18
    E: r3/a4:  0x00000000 r12/ip:  0x00000000 r14/lr:  0x00023fa3
    E:  xpsr:  0x69100000
    E: s[ 0]:  0xffffffff  s[ 1]:  0x000134e1  s[ 2]:  0x00000000  s[ 3]:  0x000388c3
    E: s[ 4]:  0x00008000  s[ 5]:  0x000af25c  s[ 6]:  0x000a6b84  s[ 7]:  0x000af398
    E: s[ 8]:  0x00000000  s[ 9]:  0x000728a3  s[10]:  0x00008000  s[11]:  0x20027a84
    E: s[12]:  0x20002f40  s[13]:  0x00023f99  s[14]:  0x000aca20  s[15]:  0x000af25c
    E: fpscr:  0x00000481
    E: Faulting instruction address (r15/pc): 0x0007288e
    E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
    E: Current thread: 0x2000cc18 (main)
    E: Halting system

  • The nrf5_rx stack size (CONFIG_IEEE802154_NRF5_RX_STACK_SIZE) and ot_radio_workq stack size (CONFIG_OPENTHREAD_RADIO_WORKQUEUE_STACK_SIZE) seems a bit suspicious as well. Can you increase that aswell and see if it is the same behavior? If Yes, Then can you give steps to reproduce.

Related