Logging thread stack overflow

Hello,

I am having an issue with Zephyr, it constantly reboots the system because of stack overflow.

00> [00:05:36.447,265] <err> os: r0/a1:  0x00000002  r1/a2:  0x20004d60  r2/a3:  0xf0f0f0f0
00> [00:05:36.447,296] <err> os: r3/a4:  0x2000e700 r12/ip:  0x2000a5e4 r14/lr:  0x00058045[0m
00> [00:05:36.447,326] <err> os:  xpsr:  0x41070000
00> [00:05:36.447,326] <err> os: r4/v1:  0x00000000  r5/v2:  0xffffffff  r6/v3:  0xffffffff
00> [00:05:36.447,357] <err> os: r7/v4:  0x00000000  r8/v5:  0x00000001  r9/v6:  0x00000000
00> [00:05:36.447,357] <err> os: r10/v7: 0x00000001  r11/v8: 0x20001a00    psp:  0x20026d28
00> [00:05:36.447,357] <err> os: EXC_RETURN: 0x0
00> [00:05:36.447,387] <err> os: Faulting instruction address (r15/pc): 0x00046704
00> [00:05:36.447,418] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
00> [00:05:36.447,418] <err> os: Current thread: 0x20004d60 (logging)
00> [00:05:36.780,670] <err> fatal_error: Resetting system

....

[03:04:16.903,503] <err> os: r0/a1:  0x00000002  r1/a2:  0xe000ed04  r2/a3:  0xf0f0f0f0
00> [03:04:16.903,533] <err> os: r3/a4:  0x2000e700 r12/ip:  0x00000000 r14/lr:  0x0003235d
00> [03:04:16.903,533] <err> os:  xpsr:  0x410f0021
00> [03:04:16.903,564] <err> os: r4/v1:  0x20003890  r5/v2:  0x0005a3a3  r6/v3:  0x00000007
00> [03:04:16.903,564] <err> os: r7/v4:  0x000000c6  r8/v5:  0x20003888  r9/v6:  0x00030f79
00> [03:04:16.903,594] <err> os: r10/v7: 0x00058b54  r11/v8: 0x00000000    psp:  0x20026c10
00> [03:04:16.903,594] <err> os: EXC_RETURN: 0x0
00> [03:04:16.903,594] <err> os: Faulting instruction address (r15/pc): 0x00046704
00> [03:04:16.903,625] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
00> [03:04:16.903,656] <err> os: Fault during interrupt handling
00> 
00> [03:04:16.903,656] <err> os: Current thread: 0x20004d60 (logging)
00> [03:04:17.257,568] <err> fatal_error: Resetting system[0m

Sometimes in these faults, it says "Fault during interrupt handling".

The code line for the faulting address is:

arm-none-eabi-addr2line.exe -e build\zephyr\zephyr.elf 0x00046704
ncs/v2.1.0/zephyr/kernel/thread.c:380

It is actually in reference to the SENTINEL mechanism:

I am using the following CONFIG options for logging, and even with absurd values of stack it still fails:

# -- Lora
# -- BLE
# -- MCUBoot
...

# -- Logs

CONFIG_UART_CONSOLE=n
CONFIG_USE_SEGGER_RTT=y
CONFIG_RTT_CONSOLE=y
CONFIG_LOG_BACKEND_RTT=y

CONFIG_LOG=y
CONFIG_LOG_PROCESS_THREAD=y
CONFIG_LOG_DEFAULT_LEVEL=4
CONFIG_LOG_BUFFER_SIZE=8192

CONFIG_PRINTK=n
CONFIG_LOG_PRINTK=n
CONFIG_NEWLIB_LIBC=y
CONFIG_CBPRINTF_FULL_INTEGRAL=y
CONFIG_CBPRINTF_FP_SUPPORT=y
CONFIG_NEWLIB_LIBC_FLOAT_SCANF=y
CONFIG_NEWLIB_LIBC_FLOAT_PRINTF=y
CONFIG_NEWLIB_LIBC_NANO=n

# -- Optimizations
CONFIG_NO_OPTIMIZATIONS=n

# -- Sentinel and stacks
CONFIG_STACK_SENTINEL=y

CONFIG_THREAD_STACK_INFO=y
CONFIG_THREAD_MONITOR=y
CONFIG_INIT_STACKS=y
CONFIG_THREAD_NAME=y
CONFIG_EXTRA_EXCEPTION_INFO=y
CONFIG_GPIO_LOG_LEVEL_DBG=y
CONFIG_LORA_LOG_LEVEL_DBG=y

CONFIG_BT_RX_STACK_SIZE=4096
CONFIG_IDLE_STACK_SIZE=4096
CONFIG_PRIVILEGED_STACK_SIZE=4096
CONFIG_BT_CTLR_RX_PRIO_STACK_SIZE=1024
CONFIG_BT_HCI_TX_STACK_SIZE=4096
CONFIG_LOG_PROCESS_THREAD_STACK_SIZE=100000
CONFIG_LOG_BUFFER_SIZE=8192
CONFIG_MAIN_STACK_SIZE=4096
CONFIG_ISR_STACK_SIZE=32768
CONFIG_SYSTEM_WORKQUEUE_STACK_SIZE=4096

# -- Heap
CONFIG_HEAP_MEM_POOL_SIZE=15240

Is the problem really the logging thread? Any suggestions?

Parents
  • I can see that the "logging" threads stack size is set by CONFIG_LOG_PROCESS_THREAD_STACK_SIZE

    https://github.com/nrfconnect/sdk-zephyr/blob/v3.1.99-ncs1/subsys/logging/log_core.c#L688

    But you've already set that to a really high value, so I'm not sure what the issue might be

    • Could you check the file <sample>/build/zephyr/.config and search for CONFIG_LOG_PROCESS_THREAD_STACK_SIZE and see if it's actually set
    • Do you have any child images in your application? For example mcuboot. In that case, are you sure that the log is not coming from the child image? Then you need to set CONFIG_LOG_PROCESS_THREAD_STACK_SIZE in the mcuboot prj.conf file, and confirm the value using the file <sample>/build/mcuboot/zephyr/.config
    • Would you be able to share a minimal reproducible sample, so I can test it on my end?

    Best regards,

    Simon

  • CONFIG_LOG_PROCESS_THREAD_STACK_SIZE = 65536 (I've changed it, and checked it in the build/zephyr/.config file, it keeps the value that I set).

    * Child images - Yes, I have MCUBoot enabled in my Application prj.conf file. I can see in the build/mcuboot/zephyr/.config file that CONFIG_MCUBOOT_LOG_THREAD_STACK_SIZE=768. Though there is no CONFIG_LOG_PROCESS_THREAD_STACK_SIZE. I haven't seen any log in the RTT viewer related to the MCUBoot, so I am not sure if MCUBoot is really logging anything during application runtime. Anyway, how could I change CONFIG_MCUBOOT_LOG_THREAD_STACK_SIZE?

    * I can see that MCUBoot has CONFIG_PRINTK enabled, might that be an issue? I've seen some issues in Zephyr's Github repository that it's not good to use both printk and LOG. But since I am new to Zephyr, I can't really confirm this.

    * Minimal reproducible sample - I'll try to prepare this. In the meantime, any other suggestions / tryouts?

  • Hello and sorry for the late reply,

    I'm not sure what the issue might be then.

    I've asked internally

    Best regards,

    Simon

  • I got some useful input internally

    When the sentinel check fails, it means that the sentinel (magic) value is overwritten, it is not the sentinel function itself that is the issue. It does not seem like it is the logging thread, so maybe some other thread is writing out of bounds. Try placing a data write breakpoint on the sentinel location (stack start) and check what code overwrites it.

    You can also use THREAD_ANALYZER and other related Kconfig to auto print the stack usage every few seconds, and if any thread is missing from its list, the thread has likely overflowed by a huge amount and corrupting many thread stacks.

    Best regards,

    Simon

Reply
  • I got some useful input internally

    When the sentinel check fails, it means that the sentinel (magic) value is overwritten, it is not the sentinel function itself that is the issue. It does not seem like it is the logging thread, so maybe some other thread is writing out of bounds. Try placing a data write breakpoint on the sentinel location (stack start) and check what code overwrites it.

    You can also use THREAD_ANALYZER and other related Kconfig to auto print the stack usage every few seconds, and if any thread is missing from its list, the thread has likely overflowed by a huge amount and corrupting many thread stacks.

    Best regards,

    Simon

Children
  • Hello @Simon, actually I got to solve that problem by trying and changing some stack sizes within tasks that are intrinsic to the system, and also my tasks' stack sizes.

    Apparently, it is really necessary during the investigation to compile the code with CONFIG_NO_OPTIMIZATIONS=y so that the system can tell you correctly which thread and which line were executed right before going into fault.

    I have miniaturised my application by removing some tasks, then, testing and enabling each task at a time, and retesting, and so on.

    Also, apparently, you cannot enable STACK_SENTINEL in the same time as you have MPU_STACK_GUARD enabled as well (I have seen this kind of argument on github, though I cannot recall where I have seen it on github). Since the default config for nRF52840 is that it uses hardware stack guard (MPU_STACK_GUARD), I have not used STACK_SENTINEL, neither STACK_CANARIES.

    I also "disabled" the LOGGING Thread by setting CONFIG_LOG_MODE_IMMEDIATE=y. I find this a better use of logging. This forces the system to do the logging immediately, inside the own task which called the logging function, though, as a side effect causing the need of more stack size for each task which uses logging.

    The main configuration changes regarding FAULTS, HEAP size, STACK sizes, Thread Analyzer I have used for solving my problem is below:

    # ----------------------------------------------------
    # HEAP Mem, Stack, Falhas, Thread Analyzer
    # ----------------------------------------------------
    CONFIG_HEAP_MEM_POOL_SIZE=32768
    
    CONFIG_REBOOT=y
    # CONFIG_RESET_ON_FATAL_ERROR=y
    CONFIG_FATAL_ERROR_LOG_LEVEL_DBG=y
    CONFIG_BT_HCI_VS_FATAL_ERROR=y
    
    
    CONFIG_EXTRA_EXCEPTION_INFO=y
    
    CONFIG_THREAD_NAME=y
    CONFIG_THREAD_ANALYZER=y
    CONFIG_THREAD_STACK_INFO=y
    CONFIG_THREAD_MONITOR=y
    CONFIG_THREAD_ANALYZER_ISR_STACK_USAGE=y
    CONFIG_THREAD_ANALYZER_AUTO_INTERVAL=5
    # CONFIG_THREAD_ANALYZER_USE_PRINTK=n
    CONFIG_THREAD_ANALYZER_AUTO=y
    CONFIG_THREAD_ANALYZER_USE_LOG=y
    CONFIG_THREAD_ANALYZER_AUTO_STACK_SIZE=65536
    
    CONFIG_INIT_STACKS=y
    CONFIG_STACK_USAGE=y
    # Aparentemente stack sentinel e mpu stack guard nao podem ser usados simultaneamente (fonte: github, investigar mais a fundo)
    #CONFIG_MPU_STACK_GUARD=y 
    #CONFIG_STACK_SENTINEL=y
    
    
    # --- Tamanhos Stacks ---
    CONFIG_IDLE_STACK_SIZE=4096
    CONFIG_MAIN_STACK_SIZE=8196
    CONFIG_ISR_STACK_SIZE=8196
    CONFIG_MPSL_WORK_STACK_SIZE=8196
    CONFIG_SYSTEM_WORKQUEUE_STACK_SIZE=8196
    
    
    CONFIG_BT_RX_STACK_SIZE=4096
    CONFIG_BT_HCI_TX_STACK_SIZE=4096
    
    CONFIG_PRIVILEGED_STACK_SIZE=8196
    

Related