This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Stack overflow

Hello,

I have been running this program without any problem then I added some code to read and write to flash using NVS when the error shown below appeared.

I am running a single ( main() ) thread application in Segger Embedded Studio v5.34a.

*** Booting Zephyr OS build v2.4.99-ncs1-rc1 ***
[00:00:13.434,600] [1;31m<err> os: ***** USAGE FAULT *****
[00:00:13.434,600] [1;31m<err> os: Stack overflow (context area not valid)
[00:00:13.434,600] [1;31m<err> os: r0/a1: 0x00000000 r1/a2: 0x0000cbfd r2/a3: 0x00000000
[00:00:13.434,600] [1;31m<err> os: r3/a4: 0x0000cbfd r12/ip: 0x207bc4ae r14/lr: 0x77fadffe
[00:00:13.434,600] [1;31m<err> os: xpsr: 0xfdd9ca00
[00:00:13.434,600] [1;31m<err> os: Faulting instruction address (r15/pc): 0x05221210
[00:00:13.434,631] [1;31m<err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:13.434,631] [1;31m<err> os: Current thread: 0x20000668 (unknown)
[00:00:13.696,472] [1;31m<err> fatal_error: Resetting system

If I the solution is to increase the stack size, how do I do it?

Can someone please help?

Kind regards

Mohamed

Parents
  • If the function that caused the stack overflow is called from main, increasing CONFIG_MAIN_STACK_SIZE would be the solution then.

    Best regards,

    Simon

  • Thank you Simon.

    What is the solution if the function that is causing the stack overflow is not called from main() but few levels down the function call tree?

    Kind regards

    Mohamed

  • Thank you Simon.

    I think changes to any configuration settings (prj.conf, overlay, Kconfig...) necessitates to re-open nRF Connect SDK project via File... I did  not think SES had to be restarted.

    Anyway, the problem has disappeared for now. Let's hope it will not show its ugly head again.

    Is there anything in the map file that could give me pointers when the stack size is not big enough?

    Kind regards

    Mohamed

  • Learner said:
    Is there anything in the map file that could give me pointers when the stack size is not big enough?

    I'm not totally sure about this. You could search for the faulting address in the map file and figure out what's causing the fault stack overflow, and set a break point at that location to see what thread it's running from, and then increase the stack size of that thread.

    The Thread analyzer is probably the best way of analyzing the stack usage of the threads.

    Learner said:
    I think changes to any configuration settings (prj.conf, overlay, Kconfig...) necessitates to re-open nRF Connect SDK project via File... I did  not think SES had to be restarted.

    Yes, you are correct about this. But reopening SES follows that you run File.. again, and that might include new dts/overlay/Kconfig changes that wasn't present before you reopened SES. However, something completely different might have triggered the fault, and let's hope it doesn't show up again.

    Best regards,

    Simon

  • Thank you Simon.

    Yes, the Thread analyzer has proved to be very useful.

    Let's just hope that it will not occur again.

    Kind regards

    Mohamed

  • Hi Simon,

    The stack overflow problem is showing its ugly head again.

    The fault is occurring in the function k_work_handler_t pid4_tasks( void ) which is setup in main() as follows,

    static struct k_work main_work_pid4;

    void main( void )
    {
           k_work_init( &main_work_pid4, pid4_tasks );
           ...

           while (1)
           {
                ...

                k_work_submit( &main_work_pid4 );

                ...

           }

    }

    The main() stack is configured in the overlay file and so is the THREAD_ANALYZER stack,

    CONFIG_MAIN_STACK_SIZE=4096

    CONFIG_THREAD_ANALYZER_AUTO_STACK_SIZE=2048

    The stack analyzer output is shown below in bold. Although I am configuring the main to be 4096, the stack analyzer is showing stack sizes of 2048, 1024 320 and 4096, why is this?

    The stack analyzer is not showing stack usage greater than 4096. In fact, I doubled the size of the main stack but the fault is still occurring. So, it could the stack overflow is the consequence of this  error "Stacking error (context area might be not valid)".

    I am attaching a picture of the debugger status at the point the fault occurs.

    I did not want to increase 

    Below is a trace capture of the Debug Terminal in SES IDE. 

    Please help.

    Kind regards

    Mohamed

    *** Booting Zephyr OS build v2.4.99-ncs1-rc1 ***
    Thread analyze:
    0x20002780 : STACK: unused 1448 usage 600 / 2048 (29 %); CPU: 0 %
    0x20002fc8 : STACK: unused 1636 usage 412 / 2048 (20 %); CPU: 0 %
    0x20002db8 : STACK: unused 628 usage 396 / 1024 (38 %); CPU: 0 %
    0x20002860 : STACK: unused 348 usage 676 / 1024 (66 %); CPU: 0 %
    0x20002f08 : STACK: unused 288 usage 32 / 320 (10 %); CPU: 0 %
    0x20002e60 : STACK: unused 3508 usage 588 / 4096 (14 %); CPU: 99 %
    Thread analyze:
    0x20002780 : STACK: unused 1208 usage 840 / 2048 (41 %); CPU: 0 %
    0x20002fc8 : STACK: unused 1636 usage 412 / 2048 (20 %); CPU: 0 %
    0x20002db8 : STACK: unused 628 usage 396 / 1024 (38 %); CPU: 0 %
    0x20002860 : STACK: unused 348 usage 676 / 1024 (66 %); CPU: 0 %
    0x20002f08 : STACK: unused 36 usage 284 / 320 (88 %); CPU: 19 %
    0x20002e60 : STACK: unused 3508 usage 588 / 4096 (14 %); CPU: 80 %
    LR1110 Driver Version: v2.0.1
    LR1110 Firmware Version: HW: 22 Type: 01, FW: 03.03
    System Errors = 0x20
    System Errors = 0x0
    Counter = 1
    LR1110 Modem Packet Type = 1
    Counter = 2

    Thread analyze:
    0x20002780 : STACK: unused 1208 usage 840 / 2048 (41 %); CPU: 0 %
    0x20002fc8 : STACK: unused 1636 usage 412 / 2048 (20 %MCU TEMP = -588.24 0.000660 21.95 °C
    ); CPU: 0 %
    --- 8 messages dropped ---
    delta_time_ms = 0
    Thread analyze:
    0x20002780 : STACK: unused 1208 usage 840 / 2048 (41 %); CPU: 0 %
    0x20002fc8 : STACK: unused 1036 usage 1012 / 2048 (49 %); CPU: 42 %
    0x20002db8 : STACK: unused 628 usage 396 / 1024 (38 %); CPU: 0 %
    0x20002860 : STACK: unused 300 usage 724 / 1024 (70 %); CPU: 0 %
    0x20002f08 : STACK: unused 36 usage 284 / 320 (88 %); CPU: 16 %
    0x20002e60 : STACK: unused 2988 usage 1108 / 4096 (27 %); CPU: 41 %
    MCU TEMP = -588.24 0.000660 21.95 °C
    [00:01:03.042,20[00:01:03.042,20delta_time_ms = 0


    [00:01:23.802,246] <err> os: ***** MPU FAULT *****
    [00:01:23.802,276] <err> os: Stacking error (context area might be not valid)
    [00:01:23.802,276] <err> os: Data Access Violation
    [00:01:23.802,307] <err> os: MMFAR Address: 0x200056dc
    [00:01:23.802,307] <err> os: r0/a1: 0x00000000 r1/a2: 0xaaaaaaaa r2/a3: 0xe288c458
    [00:01:23.802,337] <err> os: r3/a4: 0xdafdd530 r12/ip: 0x568e7d0e r14/lr: 0x3c74b9ef
    [00:01:23.802,337] <err> os: xpsr: 0xf48f4a00
    [00:01:23.802,368] <err> os: Faulting instruction address (r15/pc): 0x6052b7d0
    [00:01:23.802,368] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
    [00:01:23.802,398] <err> os: Current thread: 0x20002860 (unknown)
    [00:01:24.103,271] <err> fatal_error: Resetting system


    *** Booting Zephyr OS build v2.4.99-ncs1-rc1 ***
    Thread analyze:
    0x20002780 : STACK: unused 1448 usage 600 / 2048 (29 %); CPU: 0 %
    0x20002fc8 : STACK: unused 1636 usage 412 / 2048 (20 %); CPU: 1 %
    0x20002db8 : STACK: unused 628 usage 396 / 1024 (38 %); CPU: 1 %
    0x20002860 : STACK: unused 348 usage 676 / 1024 (66 %); CPU: 18 %
    0x20002f08 : STACK: unused 288 usage 32 / 320 (10 %); CPU: 0 %
    0x20002e60 : STACK: unused 3508 usage 588 / 4096 (14 %); CPU: 78 %
    Thread analyze:
    0x20002780 : STACK: unused 1208 usage 840 / 2048 (41 %); CPU: 0 %
    0x20002fc8 : STACK: unused 1636 usage 412 / 2048 (20 %); CPU: 0 %
    0x20002db8 : STACK: unused 628 usage 396 / 1024 (38 %); CPU: 0 %
    0x20002860 : STACK: unused 348 usage 676 / 1024 (66 %); CPU: 0 %
    0x20002f08 : STACK: unused 36 usage 284 / 320 (88 %); CPU: 99 %
    0x20002e60 : STACK: unused 3508 usage 588 / 4096 (14 %); CPU: 0 %
    LR1110 Driver Version: v2.0.1
    LR1110 Firmware Version: HW: 22 Type: 01, FW: 03.03
    System Errors = 0x20
    System Errors = 0x0
    Counter = 1


    LR1110 Modem Packet Type = 1
    Counter = 2


    Thread analyze:
    0x20002780 : STACK: unused 1208 usage 840 / 2048 (41 %); CPU: 0 %
    0x20002fc8 : STACK: unused 1636 usage

  • Hi Mohamed

    It may be helpful to set CONFIG_THREAD_NAME=y, then you will see the names of the threds instead of just a hex number.

    It seems like the thread 0x20002860 caused the MPU fault ("Current thread: 0x20002860 (unknown)"). I think an MPU fault may be a consequence of a stack overflow, and I can see that the size of that thread is 1024 bytes ("0x20002860 : STACK: unused 348 usage 676 / 1024..."). It seems like this is the main thread (based on your findings), so I'm not sure why the size is only 1024 when you set CONFIG_MAIN_STACK_SIZE=4096. You could try to take a look at the file <sample>/build/zephyr/.config and check what CONFIG_MAIN_STACK_SIZE is equal to. You could also set CONFIG_THREAD_NAME=y, and figure out exactly which thread is causing the issue.

    If this didn't help, could you upload the sample in zipped format and I'll take a look at it.

    Best regards,

    Simon

Reply
  • Hi Mohamed

    It may be helpful to set CONFIG_THREAD_NAME=y, then you will see the names of the threds instead of just a hex number.

    It seems like the thread 0x20002860 caused the MPU fault ("Current thread: 0x20002860 (unknown)"). I think an MPU fault may be a consequence of a stack overflow, and I can see that the size of that thread is 1024 bytes ("0x20002860 : STACK: unused 348 usage 676 / 1024..."). It seems like this is the main thread (based on your findings), so I'm not sure why the size is only 1024 when you set CONFIG_MAIN_STACK_SIZE=4096. You could try to take a look at the file <sample>/build/zephyr/.config and check what CONFIG_MAIN_STACK_SIZE is equal to. You could also set CONFIG_THREAD_NAME=y, and figure out exactly which thread is causing the issue.

    If this didn't help, could you upload the sample in zipped format and I'll take a look at it.

    Best regards,

    Simon

Children
  • Hi Simon,

    You could also set CONFIG_THREAD_NAME=y,

    Thank you. I will add the config line above in prj.conf.

    I did a quick search for the string "STACK" in .config and found the following occurrences.

    CONFIG_MAIN_STACK_SIZE=4096
    CONFIG_PRIVILEGED_STACK_SIZE=1024
    CONFIG_BT_HCI_TX_STACK_SIZE=1536
    CONFIG_BT_RX_STACK_SIZE=1024
    CONFIG_SYSTEM_WORKQUEUE_STACK_SIZE=2048
    CONFIG_SDC_RX_STACK_SIZE=1024
    CONFIG_MPSL_SIGNAL_STACK_SIZE=1024
    CONFIG_STACK_ALIGN_DOUBLE_WORD=y
    CONFIG_ARM_STACK_PROTECTION=y
    CONFIG_MPU_STACK_GUARD=y
    CONFIG_IDLE_STACK_SIZE=320
    CONFIG_ISR_STACK_SIZE=2048
    CONFIG_TEST_EXTRA_STACKSIZE=0
    CONFIG_HW_STACK_PROTECTION=y
    CONFIG_GEN_PRIV_STACKS=y
    # CONFIG_STACK_GROWS_UP is not set
    CONFIG_ARCH_HAS_STACK_PROTECTION=y
    CONFIG_THREAD_STACK_INFO=y
    # CONFIG_INIT_STACKS is not set
    # CONFIG_STACK_CANARIES is not set
    CONFIG_STACK_POINTER_RANDOM=0
    # CONFIG_BT_HCI_TX_STACK_SIZE_WITH_PROMPT is not set
    CONFIG_BT_HCI_ECC_STACK_SIZE=1100
    # CONFIG_STACK_USAGE is not set
    # CONFIG_STACK_SENTINEL is not set
    CONFIG_LOG_PROCESS_THREAD_STACK_SIZE=768

    As you can see the main stack is indeed 4096 but there are three other stack sizes of 1024. We cam ignore CONFIG_BT_RX_STACK_SIZE=1024 because the BT code is disabled. That leaves only these two, which threads are they specifying the stack size for?

    CONFIG_PRIVILEGED_STACK_SIZE=1024

    CONFIG_MPSL_SIGNAL_STACK_SIZE=1024

    The code that is crashing is not a sample example, it is our proprietary code under development. So, I am not allowed to share it.

    Since the fault is occurring in the function k_work_handler_t pid4_tasks( void ) which is setup in main() as follows, 

    static struct k_work main_work_pid4;

    void main( void )
    {
           k_work_init( &main_work_pid4, pid4_tasks );
           ...

           while (1)
           {
                ...

                k_work_submit( &main_work_pid4 );

                ...

           }

           ...

    }

    Could it be that the thread where the fault is occurring is the workqueue thread main_work_pid4?

    Where can I find more details about MPU FAULTStacking error (context area might be not valid) and Data Access Violation?

    I have now disabled the new code that I added recently and the fault has disappeared. So, next I am going to add the new code back in gradually and see when the fault re-appears.

    Kind regards     

    Mohamed

  • Learner said:
    Could it be that the thread where the fault is occurring is the workqueue thread main_work_pid4?

    Yes, of course. I was way too fast to answer you earlier. My apologies for that. If the fault occured in the work item handler, it is the system work queue thread that is running and I think the CONFIG_SYSTEM_WORKQUEUE_STACK_SIZE should get modified to increase the stack size.

    If it doesn't help to increase that config, I'll get back to you tomorrow about how to to debug the MPU fault. 

  • Hi Simon,

    I rolled back to an old version of the firmware which ran fine without any stack crash. I then started adding gradually the new code that I had experienced the stack crash error with. I have now completed adding all the code (I think) but there is no sign of the stack crash Slight smile. I am rather puzzled by this because I wanted the crash to occur again so that I can identify the line(s) of code that caused it in the first place.

    I still would like you to send me more information on MPU FAULTStacking error (context area might be not valid) and Data Access Violation and how to debug such problems.

    Thank you.

    Kind regards

    Mohamed

  • The MPU fault  is due to a data access violation, however when browsing DevZone/Google/Zephyr-GitHub, a stack overflow is almost always the cause of an MPU fault in Zephyr. This is the case for you too, since "Stack overflow" was logged first, followed by the MPU Fault. I think the MPU fault happens because it tries to access an invalid memory address outside the stack. 

    To be honest, I don't have too much experience debugging hardfaults, but if you want to understand it better I would recommend you to take a look at the Arm Cortex-M33 documentation.

    Best regards,

    Simon

Related