This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

FreeRTOS: Context switch hardfault

Hello, 

When debugging one of the projects, I encountered the situation of a frequent reset of the device during connection. A quick scan under the debugger helped show the cause: a hard failure.

More detailed debugging helped identify the following points:

  1. Error occurs only in the state of connection. This is reasonable, since only in the state of the connection processes of receiving, processing and transmitting data are started;
  2. The error does not depend on the degree of filling the stacks. The stacks were filled with a pattern using a macro. Analysis of the contents of the stack after the error shows that it can be either overflowed or free by 2/3;
  3. The RTOS task stacks are allocated with a surplus. Reducing / increasing the size does not affect the frequency of the error;
  4. The project uses several objects of the nrf_queue. If an error occurs, the contents of the field control structure (p_cb) of one or more objects has an invalid value. This fact allowed the use of an additional breakpoint (I signed color variables whose addresses are used to stop):

The breakpoint data helped to detect that an error occurs at the moment of switching the context of the RTOS in the interrupt vector - xPortPendSVHandler. The "watch" windows use for control values:

Also, the data values in the buffers are corrupted. At first glance, this data is similar to the contents of the stack, but I'm not so sure.

If you do not use breakpoints, then when a hard failure occurs, the output to the log is as follows:

HARD FAULT at 0x00000000 
R0: 0x20000464 R1: 0x00000000 R2: 0x00000000 R3: 0x20014D30 
R12: 0xFFFFFFFF LR: 0x00000000 PSR: 0x4000000E 
Cause: The processor has attempted to execute an instruction that makes illegal use of the EPSR.

Disassembly window with 0x20014D30:

The value R0 = 0x20000464  is in the stack address space.


So, at the moment I see the result in the form of a falling program and the reason in the form of an error when switching context. But I can not understand why this error occurs. 

1. If I understand correctly, the queue module is used within itself by critical sections for write / read operations. How safe is it to use these functions as part of a RTOS?

2. Maybe there is a possibility of a simultaneous occurrence of a write event to the nrf_queue and a context switch? But so far I have no idea how to catch it.

Below I give a list of interrupts involved by me. Priority for them, I use the default.


Used:

HW: Fanstel BT840 (nrf52840);

FW: SDK 15.3.0, project based on "ble_app_hrs_freertos". In project added: nrf_queue module. 

List of interupt:

GPIOTE->EVENTS_IN (~250 Hz);

TWIM1_EVENT_ERROR / EVENT_STOPPED (~250 Гц)

NRF_TIMER4->CC (10Hz);

TWIM0_EVENT_ERROR / EVENT_STOPPED (10 Гц)

SW: Segger Embedded studio 4.18

Parents
  • So, one more working day is progress. It looks like I did not take into account the real priority of interrupts and the requirements of the RTOS.

    In the file FreeRTOSConfig.h describes the minimum priority level and the maximum for interrupts that use the RTOS api functions (with _fromISR). 


    I also did not notice that Nordic recommends (or requires?) the following levels of priorities (file app_util_platform.h):

    #define _PRIO_SD_HIGH       0
    #define _PRIO_SD_MID        1
    #define _PRIO_APP_HIGH      2
    #define _PRIO_APP_MID       3
    #define _PRIO_SD_LOW        4
    #define _PRIO_SD_LOWEST     5
    #define _PRIO_APP_LOW       6
    #define _PRIO_APP_LOWEST    7
    #define _PRIO_THREAD        15

    Indeed, the use of task priorities and priorities of interrupts MK is very complicated.

    But I would like to get an answer to my question related to the use of the queue module in conjunction with the RTOS. Are there any restrictions or recommendations?

    Thanks

  • Hi, good to hear that you have had progress. The Softdevice reserves some of the interrupt priorities (if you're using Bluetooth), but not sure how wrong interrupt priority would cause the data corruption. Maybe it could be related to the issue Jakub mentions here: https://devzone.nordicsemi.com/f/nordic-q-a/46916/nrf_queue-will-not-fit-in-region-unplaced_sections-on-nrf52810/185496#185496  

  • Hi,

    I saw this post earlier when I did a preliminary search for possible problems before adding a queue module to the code. And even copied the proposed corrections just in case.


    My question is related to this section of the code:

    /**@brief Macro for entering a critical region.
     *
     * @note Due to implementation details, there must exist one and only one call to
     *       CRITICAL_REGION_EXIT() for each call to CRITICAL_REGION_ENTER(), and they must be located
     *       in the same scope.
     */
    #ifdef SOFTDEVICE_PRESENT
    #define CRITICAL_REGION_ENTER()                                                             \
        {                                                                                       \
            uint8_t __CR_NESTED = 0;                                                            \
            app_util_critical_region_enter(&__CR_NESTED);
    #else
    #define CRITICAL_REGION_ENTER() app_util_critical_region_enter(NULL)
    #endif
    
    /**@brief Macro for leaving a critical region.
     *
     * @note Due to implementation details, there must exist one and only one call to
     *       CRITICAL_REGION_EXIT() for each call to CRITICAL_REGION_ENTER(), and they must be located
     *       in the same scope.
     */
    #ifdef SOFTDEVICE_PRESENT
    #define CRITICAL_REGION_EXIT()                                                              \
            app_util_critical_region_exit(__CR_NESTED);                                         \
        }

    Interrupts will be temporarily disabled and then allowed. If you use these functions in tasks with low priority, can a situation of knowledge be known as priority inversion (a low-priority thread blocks a high-priority)?


    As far as I understood, I had the following: the simultaneous occurrence of high priority interruption and the process context switching process. At the same time, the contents of the registers (R0 - R12) seemed to overlap and instead of being saved to the stack, they were saved into an arbitrary memory area.
    This can be taken as an explanation of the situation when the contents of the data buffers had the form of code located in the stack area.

    I could investigate this issue in detail if there is a profiler, Segger SysView for example. But I had difficulties with its integration into the project. An example that you have created for the stack s132 and nrf52832 device. Is it possible (or the availability of a ready solution for your team) to adapt that code under the stack s140 (nrf52840)?

  • Hi, I don't think priority inversion should lead to corruption and writes to an arbitrary memory area, but I'm also not sure what's causing the memory corruption. Did you consider to use the Freertos module for Queing? https://www.freertos.org/a00018.html

  • Good Moning,
    No, I use queue module from NRF SDK (nrf queue.c \ nrf queue.h).

    It seemed to me that it is more flexible (and yes, I like its API) and is better suited for my tasks (data damping before and after processing).

    At the moment, after changing priorities, the problem situation is not repeated anymore. But the question is: what exactly the reason remains unanswered (

  • CheMax said:
    At the moment, after changing priorities, the problem situation is not repeated anymore. But the question is: what exactly the reason remains unanswered

     Any task synchronizing mechanism could cause a temporary priority inversion and this is expected. But this should not cause a stack corruption as this is allowed and like I said very much expected. It looks like the sanity of the data pointers in your lower priority thread is not checked before being dereferenced. Seems like you have some design flaw in how the two tasks communicate. I would recommend you to atleast check the sanity of the data pointers being used if they are accessed by more than one task.

Reply
  • CheMax said:
    At the moment, after changing priorities, the problem situation is not repeated anymore. But the question is: what exactly the reason remains unanswered

     Any task synchronizing mechanism could cause a temporary priority inversion and this is expected. But this should not cause a stack corruption as this is allowed and like I said very much expected. It looks like the sanity of the data pointers in your lower priority thread is not checked before being dereferenced. Seems like you have some design flaw in how the two tasks communicate. I would recommend you to atleast check the sanity of the data pointers being used if they are accessed by more than one task.

Children
  • I apologize for the long absence and silence in this topic, I was on vacation in connection with the birth of my daughter.

    Yes, the problem was similar in the incorrect organization of access to a shared resource (pointer to data). Plus, you must correctly prioritize threads (and interrupts).

    I believe that the answer was received during the discussion. the correct answers in my opinion I marked

Related