USAGE FAULT Stack Overflow when using NCS 3.1.1

I'm currently trying to integrate Memfault to my project that is using nRF9151 DK. Memfault was working fine until I turned on these configs:

# Collect LTE Metrics
CONFIG_MEMFAULT_NCS_LTE_METRICS=y
CONFIG_LTE_LC_EDRX_MODULE=y
CONFIG_LTE_LC_PSM_MODULE=y
CONFIG_MODEM_INFO=y

Then I keep getting these Usage Fault error. I'm not sure where to proceed to debug, I've increased the stack size of the system work queue but seems like it isn't that.

CONFIG_SYSTEM_WORKQUEUE_STACK_SIZE=4096

[00:01:50.408,813] <err> os: ***** USAGE FAULT *****
[00:01:50.414,489] <err> os:   Stack overflow (context area not valid)
[00:01:50.421,752] <err> os: r0/a1:  0x2000ca74  r1/a2:  0x20013fa8  r2/a3:  0x0003a024
[00:01:50.430,450] <err> os: r3/a4:  0x20014048 r12/ip:  0x000000df r14/lr:  0x00029b3b
[00:01:50.439,147] <err> os:  xpsr:  0x61000000
[00:01:50.444,396] <err> os: s[ 0]:  0x00000200  s[ 1]:  0x0002b089  s[ 2]:  0x00000000  s[ 3]:  0xe000ed00
[00:01:50.454,864] <err> os: s[ 4]:  0x2000e738  s[ 5]:  0x00000000  s[ 6]:  0x000000de  s[ 7]:  0x00027d31
[00:01:50.465,332] <err> os: s[ 8]:  0x000279d8  s[ 9]:  0x61000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:01:50.475,799] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:01:50.486,267] <err> os: fpscr:  0x00000000
[00:01:50.491,485] <err> os: r4/v1:  0x20013fa8  r5/v2:  0x0003a024  r6/v3:  0x20014048
[00:01:50.500,213] <err> os: r7/v4:  0x20012828  r8/v5:  0x0003a024  r9/v6:  0xffffffff
[00:01:50.508,941] <err> os: r10/v7: 0x00000000  r11/v8: 0x00000000    psp:  0x20013f20
[00:01:50.517,669] <err> os: EXC_RETURN: 0xffffffac
[00:01:50.523,254] <err> os: Faulting instruction address (r15/pc): 0x0002a3b6
I tried to use Zephyr Thread Analyzer (CONFIG_THREAD_ANALYZER_AUTO=y), and seem like everything looks fine:

[00:01:40.454,498] <inf> thread_analyzer: Thread analyze:
[00:01:40.454,742] <inf> thread_analyzer:  thread_analyzer     : STACK: unused 1456 usage 592 / 2048 (28 %); CPU: 0 %
[00:01:40.454,772] <inf> thread_analyzer:                      : Total CPU cycles used: 862
[00:01:40.454,925] <inf> thread_analyzer:  date_time_work_q    : STACK: unused 728 usage 552 / 1280 (43 %); CPU: 0 %
[00:01:40.454,956] <inf> thread_analyzer:                      : Total CPU cycles used: 1143
[00:01:40.455,230] <inf> thread_analyzer:  mflt_http           : STACK: unused 1760 usage 288 / 2048 (14 %); CPU: 0 %
[00:01:40.455,261] <inf> thread_analyzer:                      : Total CPU cycles used: 1
[00:01:40.455,413] <inf> thread_analyzer:  work_q              : STACK: unused 736 usage 288 / 1024 (28 %); CPU: 0 %
[00:01:40.455,444] <inf> thread_analyzer:                      : Total CPU cycles used: 2
[00:01:40.455,871] <inf> thread_analyzer:  sysworkq            : STACK: unused 2852 usage 1244 / 4096 (30 %); CPU: 0 %
[00:01:40.455,902] <inf> thread_analyzer:                      : Total CPU cycles used: 532
[00:01:40.456,146] <inf> thread_analyzer:  logging             : STACK: unused 1412 usage 636 / 2048 (31 %); CPU: 2 %
[00:01:40.456,176] <inf> thread_analyzer:                      : Total CPU cycles used: 86038
[00:01:40.456,268] <inf> thread_analyzer:  idle                : STACK: unused 272 usage 48 / 320 (15 %); CPU: 96 %
[00:01:40.456,298] <inf> thread_analyzer:                      : Total CPU cycles used: 3168420
[00:01:40.456,726] <inf> thread_analyzer:  main                : STACK: unused 2920 usage 1176 / 4096 (28 %); CPU: 0 %
[00:01:40.456,756] <inf> thread_analyzer:                      : Total CPU cycles used: 14547
[00:01:40.457,000] <inf> thread_analyzer:  ISR0                : STACK: unused 1656 usage 392 / 2048 (19 %)

For more context, I'm following the nRF91 simple tracker solution, but instead of using CoAP, I used MQTT instead, which is in Lesson 4 of the same course. The way I switch from LTE and GNSS is the same as the exercise flowchart (attached below).

I tried using addr2line following the lesson here:

arm-zephyr-eabi-addr2line -e build/nRF9151_lte/zephyr/zephyr.elf 0x0002a3b6

But the output shows:

??:?

I'm not sure where to debug or how to approach debugging this. Any help is appreciated. Thank you.

  • Hello,

    It is possible that the tread analyzer is not showing the thread that had the stack overflow. Are you able to see where the call was made from if you look up the LR address instead (0x00029b3b - 1 thumb bit)?

    $ arm-zephyr-eabi-addr2line -e build/nRF9151_lte/zephyr/zephyr.elf  0x00029b3a 

    I also see that the crashlog is including the FPU registers. Did you enable CONFIG_FPU_SHARING in your project? 

    Thanks,

    Vidar

  • Thank you for your response.

    Using 0x00029b3b to addr2line also return ??:?

    I tried other addresses in the fault message and it either returns ??:? or ??:0.

    I have the CONFIG_FPU=y because I need it for handle GNSS latitude and longitude. I checked the .config file in the build folder and it has both CONFIG_FPU=y and CONFIG_FPU_SHARING=y

    I decided to enable Memfault shell to manually export the coredump. Here is what I found:

    The stack trace is essentially:

    __ssvfscanf_r

     _vsscanf_r

    nrf_modem_at_scanf

    modem_info_get_rsrp (nrf/lib/modem_info/modem_info.c:882)

    modem_params_get (nrf/modules/memfault-firmware-sdk/memfault_lte_metrics.c:76)

    lte_handler (nrf/modules/memfault-firmware-sdk/memfault_lte_metrics.c:119)

    event_handler_list_dispatch (nrf/lib/lte_link_control/common/event_handler_list.c:113)

    and this is the log before the crash:

    <inf> mflt: Reset Reason, RESETREAS=0x10001
    <inf> mflt: Reset Causes:I<inf> mflt:  Pin Reset
    <inf> mflt: GNU Build ID: b370b80ce32a5558123521aba7473e7ee940a544
    <inf> mflt: Periodic background upload scheduled - initial delay=63s period=120s
    <inf> nRF9151_LTE: Initializing modem library
    <inf> nRF9151_LTE: Comparing credentials: Match
    <inf> nRF9151_LTE: Connecting to LTE network
    <inf> nRF9151_LTE: LTE cell changed: Cell ID: <id>, Tracking area: <area-id>
    <inf> nRF9151_LTE: RRC mode: Connected
    <inf> nRF9151_LTE: Network registration status: Connected - roaming
    <inf> nRF9151_LTE: Connected to LTE network

    After connecting to LTE, I follow the similar behavior in my loop to disable it to switch to GNSS. Reference from this sample code. I'm just wondering if it would be the cause? For example, if memfault is collecting lte metrics, and the modem_info.c:882 calls AT command, would it cause the modem to crash/hang?

  • I'm surprised that addr2line failed to return a valid file and line number. I'm not sure what the reason for that could be. But since the crash log reports a stack overflow, it would be good to first determine which stack it occurred in. I would also try doubling the main thread stack to see if it make any difference (CONFIG_MAIN_STACK_SIZE=8192)

    The crash log should end with a line that tells which thread the fault occurred in, for example:

    E: Current thread: 0x200026a8 (unknown)

    If you compile with CONFIG_THREAD_INFO=y, the thread name will also be included. If this line is not printed for some reason, you can look up the PSP address in your zephyr.map file to see which address range the process stack pointer was in when the stack guard got triggered.

  • Thank you so much! I looked into the zephyr.map and zephyr_final.map and the PSP address falls into the range for the work_q_stack.

    .noinit."WEST_TOPDIR/nrf/lib/lte_link_control/common/work_q.c".0
                    0x0000000020013d10      0x400 modules/nrf/lib/lte_link_control/lib..__nrf__lib__lte_link_control.a(work_q.c.obj)
                    0x0000000020013d10                work_q_stack

    I increase the stack size to CONFIG_LTE_LC_WORKQUEUE_STACK_SIZE=2048

    Still not sure why the crash log and addr2line didn't print out this info but everything works now! Thanks again!

  • Excellent! Thank you for reporting back on what the issue was.

Related