MPU FAULT, Stack Guard, and Hard Faults


Hello, I'm using the nrf v2.3.0 SDK on a 5340 and 7200 platform.  I'm trying to get hooks into some of the lower level fault handling.


Reading through dev zone tickets, it looks like STACK_SENTINEL is not needed if MPU is configured to be enabled.


Is it true that the MPU will detect stack overflows on each of my threads?


Is there an example of how to detect when this happens, i.e. a callback that I can override to detect it and store some information in NVS?

Related, is there an example of how to do the same thing generically with the hard fault handler?

There are a number of hits and documentation listed on the devzone, but many are out of date, for example:

https://devzone.nordicsemi.com/f/nordic-q-a/47040/stack-overflow-and-stack-guard
https://devzone.nordicsemi.com/f/nordic-q-a/90620/stack-guard-module-raises-hard-fault-not-mmu-fault
https://infocenter.nordicsemi.com/index.jsp?topic=%2Fsdk_nrf5_v17.1.0%2Flib_hardfault.html&cp=8_1_3_22

Thanks

W

  • Hello,

    Yes, the posts you linked were for the nRF5 SDK. You can read about how the stack guard is implemented for Cortex devices in the Zephyr documentation here: MPU-assisted stack overflow detection and Stack limit checking (Arm v8-M). The latter will be selected by default on the nRF5340 since the CPUs include this feature.

    With regards to storing crash information to flash, please see the discussion in this forum thread:  Saving coredumps to external flash   

    Best regards,

    Vidar

  • Is there an example of how to detect when this happens, i.e. a callback that I can override to detect it and store some information in NVS?

    I forgot to add that you are free to provide your own implementation of the common k_sys_fatal_error_handler if you want to process the faults from your application. E.g. to store error information to flash.

    https://developer.nordicsemi.com/nRF_Connect_SDK/doc/2.4.0/zephyr/kernel/services/other/fatal.html 

  • thank you.  I am able to override the fatal handler if I set 

    CONFIG_RESET_ON_FATAL_ERROR=n so that the NCS lib does not override it.

    An issue that I can't seem to get around is that I am only able to save assert information to my serial flash if I assert during a shell command, and I can't save fatal error reasons at all.  Is this due to the context of a shell command being a more privileged mode?  
    Here is my assert_post_action function:
    void assert_post_action(const char *file, unsigned int line)
    {
        struct fs_file_t assert_file;
        char path[MAX_PATH_LEN];
        char assert_msg[ASSERT_STR_LEN_MAX];
    
        /* Disable interrupts, this is unrecoverable */
        (void)irq_lock();
    
        struct fs_mount_t *mp = storage_mount_point_get();
        snprintf(path, sizeof(path), "%s/%s", mp->mnt_point, "assert.json");
        fs_unlink(path); // delete old assert file if it exists
    
        // create JSON string with assert info, previous boot count, and system uptime
        sprintf(assert_msg, "{\"file\":\"%s\","
                            "\"line\":%d,"
                            "\"boot_count\":%d,"
                            "\"uptime\":%lld"
                            "}",
                file,
                line,
                boot_count_get() - 1,
                k_uptime_get() / 1000);
    
        fs_file_t_init(&assert_file);
        int rc = fs_open(&assert_file, path, FS_O_CREATE | FS_O_RDWR);
        if (rc >= 0)
        {
            fs_write(&assert_file, assert_msg, strlen(assert_msg));
            fs_close(&assert_file);
        }
    
        /* User threads aren't allowed to induce kernel panics; generate
         * an oops instead.
         */
        if (k_is_user_context())
        {
            k_oops();
        }
        else
        {
            k_panic();
        }
    
        // sys_reboot(SYS_REBOOT_COLD);
        CODE_UNREACHABLE;
    }

    Disabling the irq_lock() does not have any effect, the attempt to write to the files system does not work.
    Note, that the fatal error handler is called after this post assert function.
    The post assert function I wrote tries to write to a different file (fatal.txt) with no success as well, even if oops() or k_panic() is called from a shell function.
    static int cmd_fatal(const struct shell *shell, size_t argc, char **argv)
    {
        ARG_UNUSED(argc);
        ARG_UNUSED(argv);
        shell_print(shell, "Device will now thow a fatal error");
        k_oops();
    
        // Function should not return ----------------------------------------------------------------------------------
    
        return 0;
    }
    
    
    
    // we are able to ovveride the default weak handler with this if CONFIG_RESET_ON_FATAL_ERROR=n
    void k_sys_fatal_error_handler(unsigned int reason,
    							   const z_arch_esf_t *esf)
    {
    
        led0_init();
        for (int i = 0; i < 10; i++)
        {
            led0_set(true);
            k_busy_wait(100000);
            led0_set(false);
            k_busy_wait(100000);
        }
    
        /* Disable interrupts, this is unrecoverable */
        (void)irq_lock();
    
        // printk("FATAL ERROR\n");
    
        struct fs_file_t fatal_file;
        char path[MAX_PATH_LEN];
        char fatal_msg[128];
    
        struct fs_mount_t *mp = storage_mount_point_get();
        snprintf(path, sizeof(path), "%s/%s", mp->mnt_point, "fatal.txt");
        fs_unlink(path); // delete old fatal file if it exists
    
        sprintf(fatal_msg, "FATAL ERROR reason = %d\n", reason);
        fs_file_t_init(&fatal_file);
    
        int rc = fs_open(&fatal_file, path, FS_O_CREATE | FS_O_RDWR);
        if (rc >= 0)
        {
            fs_write(&fatal_file, fatal_msg, strlen(fatal_msg));
            fs_close(&fatal_file);
        }
    
        sys_reboot(SYS_REBOOT_COLD);
        CODE_UNREACHABLE;
    }
    
    Do you have a explanation of why my serial flash file system writes only seem to work in my assert_post_action function if I assert in a shell command?

    Best regards, thanks for the help.


    Wally
  • Hi Wally,

    Depending on the source of the error and where it occurred, the error handler may be invoked in an interrupt context. For instance, the hardfault IRQ which will block all other interrupts in your system. There is also a note about not relying on RTOS primitives in the memfault implementation, which can be found here: https://github.com/nrfconnect/sdk-nrf/blob/main/modules/memfault-firmware-sdk/memfault_flash_coredump_storage.c#L15. RTOS primitives are probably being used in your serial driver.

    I believe it's safest to store the error information in RAM if you have spare memory available, as I previously suggested in this post:RE: Saving coredumps to external flash 

     Best regards,

    Vidar

Related