Device crash when doing a software reset

I'm developing a software with the nRF Connect SDK v2.5.1. This software runs Matter on the nRF5340.

I need to reset the MCU from my app. To do so, I use the sys_reboot(SYS_REBOOT_COLD) function. But this makes the OS crash with the following trace:

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
uart:~$ E: IPC endpoint bind timed out
ASSERTION FAIL @ WEST_TOPDIR/zephyr/drivers/ieee802154/ieee802154_nrf5.c:1153
E: r0/a1: 0x00000004 r1/a2: 0x00000481 r2/a3: 0x2000ca30
E: r3/a4: 0x00000000 r12/ip: 0x00000000 r14/lr: 0x00023e0f
E: xpsr: 0x69100000
E: s[ 0]: 0xffffffff s[ 1]: 0x000134c5 s[ 2]: 0x00000000 s[ 3]: 0x00038607
E: s[ 4]: 0x00008000 s[ 5]: 0x000aeab4 s[ 6]: 0x000a6548 s[ 7]: 0x000aebf0
E: s[ 8]: 0x00000000 s[ 9]: 0x00072377 s[10]: 0x00008000 s[11]: 0x20027404
E: s[12]: 0x20002f40 s[13]: 0x00023e05 s[14]: 0x000ac36c s[15]: 0x000aeab4
E: fpscr: 0x00000481
E: Faulting instruction address (r15/pc): 0x00072362
E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
E: Current thread: 0x2000ca30 (main)
E: Halting system
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Is there a way to make it better ?

Parents
  • Based on the limited info provided, it seems like the main thread might have a stack overflow.

    Try increasing the CONFIG_MAIN_STACK_SIZE in your prj.conf and see if the assert goes away. 

    You need to understand all the contexts (RTOS threads and interrupts) on our system and get an overview on the memory usage by them. While prototyping it might be a good idea to enable THREAD_ANALYZER

  • I already tried to increase the main stack size and the work queue size, but it doesn't solve the issue.

    If it can help, the issue happens only when the device is commissioned and bonded in a Matter over Thread fabric. If the device is "offline", the reset procedure goes fine.

  • Have you enabled the Thread analyzer? Have you seen if there are any other threads that are using closer to its stack limit?  Can you post your Thread analyzer output just before this hardfault happened? If the stack looks good then we can look past the stack overflow and see what caused this hardfault.

Reply
  • Have you enabled the Thread analyzer? Have you seen if there are any other threads that are using closer to its stack limit?  Can you post your Thread analyzer output just before this hardfault happened? If the stack looks good then we can look past the stack overflow and see what caused this hardfault.

Children
  • I enabled it now with CONFIG_THREAD_ANALYZER and CONFIG_THREAD_ANALYZER_AUTO.

    Here is the log before it dies:

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    spinel_packet_send_thread: STACK: unused 840 usage 184 / 1024 (17 %); CPU: 0 %
    : Total CPU cycles used: 0
    rx_q[0] : STACK: unused 1288 usage 248 / 1536 (16 %); CPU: 0 %
    : Total CPU cycles used: 4
    openthread : STACK: unused 2712 usage 1320 / 4032 (32 %); CPU: 2 %
    : Total CPU cycles used: 987
    ot_radio_workq : STACK: unused 208 usage 816 / 1024 (79 %); CPU: 0 %
    : Total CPU cycles used: 122
    nrf5_rx : STACK: unused 272 usage 688 / 960 (71 %); CPU: 0 %
    : Total CPU cycles used: 27
    0x20006a88 : STACK: unused 468 usage 492 / 960 (51 %); CPU: 1 %
    : Total CPU cycles used: 413
    sysworkq : STACK: unused 1856 usage 192 / 2048 (9 %); CPU: 0 %
    : Total CPU cycles used: 0
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

  • The nrf5_rx stack size (CONFIG_IEEE802154_NRF5_RX_STACK_SIZE) and ot_radio_workq stack size (CONFIG_OPENTHREAD_RADIO_WORKQUEUE_STACK_SIZE) seems a bit suspicious as well. Can you increase that aswell and see if it is the same behavior? If Yes, Then can you give steps to reproduce.

  • The changes doesn't improve the result, here is the log with thread analysis a few seconds before the crash.

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    uart:~$ Thread analyze:
    CHIP : STACK: unused 4360 usage 1720 / 6080 (28 %); CPU: 1 %
    : Total CPU cycles used: 16408
    BT RX : STACK: unused 1008 usage 192 / 1200 (16 %); CPU: 0 %
    : Total CPU cycles used: 0
    BT TX : STACK: unused 696 usage 328 / 1024 (32 %); CPU: 0 %
    : Total CPU cycles used: 54
    thread_command : STACK: unused 816 usage 208 / 1024 (20 %); CPU: 18 %
    : Total CPU cycles used: 256407
    thread_analyzer : STACK: unused 544 usage 480 / 1024 (46 %); CPU: 0 %
    : Total CPU cycles used: 6329
    spinel_packet_send_thread: STACK: unused 840 usage 184 / 1024 (17 %); CPU: 0 %
    : Total CPU cycles used: 0
    rx_q[0] : STACK: unused 1288 usage 248 / 1536 (16 %); CPU: 0 %
    : Total CPU cycles used: 9
    openthread : STACK: unused 2032 usage 2000 / 4032 (49 %); CPU: 0 %
    : Total CPU cycles used: 4657
    ot_radio_workq : STACK: unused 1168 usage 816 / 1984 (41 %); CPU: 0 %
    : Total CPU cycles used: 1553
    nrf5_rx : STACK: unused 1296 usage 688 / 1984 (34 %); CPU: 0 %
    : Total CPU cycles used: 39
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    I think that to reproduce, you can use the Matter light bulb sample on a nRF5340DK, commission it into a Matter over Thread fabric then trigger the reset from a command or such.

  • This could be related to this and this errata. It is possible that the appcore is resetting but the netcore is not and the serial communication with the app and the netcore is falling apart. Try implementing the workarounds for that erratas at the init code and see if that helps.

  • I could try this next week. If on your side you can try to reproduce the issue that would be perfect as well.