Saving coredumps to external flash

Hi,

Chip: nRF52840

OS: nRF Connect / Zephyr: v2.3.0

Problem: Saving coredumps to external flash

We're trying to add a bunch of debugging features to our firmware before field trials.

We've been trying to saving coredumps to external flash, so on the next reboot we can upload them to the cloud.

It seems that Zephyr does support this using "CONFIG_DEBUG_COREDUMP_BACKEND_FLASH_PARTITION". However this doesn't seem to be supported by nordic chips for a few reasons.

I've seen another question around this subject posted about a year ago. The recommendation was to use memfault, but this seems a bit weird to have to use a commercial cloud solution to simply save coredumps to flash.

What we've tried

Un-supported fields

nordic,qspi-nor.yaml does not include "soc-nv-flash.yaml" which means properties "erase-block-size" & "write-block-size" don't exist. These are required by the flash backend impl.

Using QSPI in fatal handler

After adding the above fields & configuring our pm_static.yml file everything compiles. However when triggering a crash, coredumps are no longer created. It seems like the fatal handler is crashing (no cli outputs).

If i comment out all the flash operators in "coredump_backend_flash_partition.c" such as "flash_area_erase" & "flash_area_write" the fatal handler does run (cli outputs); obviously the coredump is not saved.

I suppose my question is can you use QSPI while in the fatal handler? I'm not sure if the QSPI driver needs re-initing, interrupts need re-enabling etc. 

Summary

Are there any examples where coredumps are being saved to external flash using nrf-sdk?

Cheers.

Parents
  • Hello,

    I don't think it's a good idea to rely on Zephyr drivers such as QSPI NOR when you are in the fault handler, because you don't know what state the system will be in. Could it be an alternative to load the coredump to a "no init" section and have it written to flash on subsequent reboot? 

    I've used this approach when debugging WDT timeouts in the past (the WDT does not give you enough time to write anything to flash before resetting):

    __noinit static z_arch_esf_t esf;
    __noinit static uint32_t esf_crc;
    void dump_stack(uint32_t *p_msp)
    {
        /* Store stack frame along with a checksum value to the __noinit section in RAM  */
        memcpy(&esf, p_msp, sizeof(esf));
        esf_crc = crc32_ieee((uint8_t *)&esf, sizeof(esf));
        /* Wait for the impending Watchdog reset */
        while (1);
    }
    #if WDT_ALLOW_CALLBACK
    static void wdt_callback(const struct device *wdt_dev, int channel_id)
    {
        /* Get current stack frame from the process stack. 
         * 
         * TODO: implement logic to determine if the application
         * was running in thread or handler mode prior to the WDT interrupt.
         * For handler mode we would have to use the main stack pointer instead. 
         */
        __ASM(" mrs r0, psp          \n"
              " ldr r3, = dump_stack \n"
              " bx r3                \n");
    }
    #endif /* WDT_ALLOW_CALLBACK */
    /* To be called on startup to check if a new exception frame has been stored by our wdt_callback() */
    void wdt_startup_check(void)
    {
        uint32_t computed_crc = crc32_ieee((uint8_t *)&esf, sizeof(esf));
        if (computed_crc == esf_crc) {
            printk("Exception stack frame:\n\r");
            printk("r0/a1:  0x%08x  r1/a2:  0x%08x  r2/a3:  0x%08x\n\r", esf.basic.a1,
                esf.basic.a2, esf.basic.a3);
            printk("r3/a4:  0x%08x r12/ip:  0x%08x r14/lr:  0x%08x\n\r", esf.basic.a4,
                esf.basic.ip, esf.basic.lr);
            printk("xpsr:  0x%08x  pc: 0x%08x\n\r", esf.basic.xpsr, esf.basic.pc);
            esf_crc = 0; // Invalidate CRC
        } else {
            printk("CRC mismatch\n\r");
        }
    }

    Cheers,

    Vidar

  • I've got this working, personally i think is area is lacking in nRF. With everyone shipping thousands of IoT devices with no remote debugging capability seems silly. Surely we shouldn't all keep re-inventing the wheel. I'll see if i get time to write a module which can implement everything. 

    It would still be good to be able to write directly to flash/QSPI. As this way we could take a full memory dump.

    Anyways, few pointers for anyone else interested:

    If your using nrf52 there's a really good example of how to use retained memory here: https://github.com/nrfconnect/sdk-zephyr/blob/main/samples/boards/nrf/system_off/src/retained.c

    To get the ESF you can override Zephyr's fatal function and simply store the values in retained memory (as the above example shows). 

    void k_sys_fatal_error_handler(unsigned int reason, const z_arch_esf_t *esf_input)

    To save the core dump you'll have to create a file called "coredump_backend_empty.c" using this as a template https://github.com/zephyrproject-rtos/zephyr/blob/main/tests/subsys/debug/coredump_backends/src/coredump_backend_empty.c

    You can then simply "memcpy" the coredump in function:

    static void coredump_empty_backend_buffer_output(uint8_t *c, size_t buflen)

    I use the following settings to keep the coredump small enough for RAM retention

    CONFIG_DEBUG_COREDUMP=y
    CONFIG_DEBUG_COREDUMP_BACKEND_OTHER=y
    CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_LINKER_RAM=n
    CONFIG_DEBUG_COREDUMP_SHELL=y
    CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_MIN=y

    Hope that helps someone.

  • Thanks you for that. I will look into it. For some reasons there so little documentation on coredump and the coredump documentation flash partition tool from zephyr seems to not dump anything when I force an hardfault. I hate that Nordic is forcing you with Memfault instead of trying to finding a solution for offline debugging

  • Memfault is not intended to replace the debug support in the SDK. Instead, it serves as an additional option for those who want remote device management without setting up and managing their own cloud solution. I've updated my internal feature request to highlight the need for a way to save core dumps to flash.

    Documentation for the Core dump module can be found here: https://developer.nordicsemi.com/nRF_Connect_SDK/doc/2.4.2/zephyr/services/debugging/coredump.html. There is currently no flash backend to save the output to a flash partition.

  • Do you guys plan on implementing it or for now we have to implement our own code to do that 

  • Sorry, there is a flash backend; it's just that our documentation has not been updated to include it yet. It is currently covered in the upstream Zephyr documentation at https://docs.zephyrproject.org/latest/services/debugging/coredump.html 

    I tried testing the coredump module with the Bluetooth: Peripheral LBS sample in nRF Connect SDK v2.4.2 and was able to get it to work after I created this workaround in the flash driver:

    diff --git a/drivers/mpsl/flash_sync/flash_sync_mpsl.c b/drivers/mpsl/flash_sync/flash_sync_mpsl.c
    index eea9cadaa..c37ebd78c 100644
    --- a/drivers/mpsl/flash_sync/flash_sync_mpsl.c
    +++ b/drivers/mpsl/flash_sync/flash_sync_mpsl.c
    @@ -140,9 +140,15 @@ void nrf_flash_sync_set_context(uint32_t duration)
     	_context.request_length_us = duration;
     }
     
    +bool is_in_fault_isr(void)
    +{
    +	uint32_t isr = __get_IPSR();
    +	return (isr >= 3 && isr <= 6);
    +}
    +
     bool nrf_flash_sync_is_required(void)
     {
    -	return mpsl_is_initialized();
    +	return mpsl_is_initialized() && !is_in_fault_isr();
     }
     
     int nrf_flash_sync_exe(struct flash_op_desc *op_desc)

    Summary of changes made to the peripheral LBS sample to enable and test the Coredump module

    Added the following lines to the prj.conf file:

    # Enable coredump with flash backend
    CONFIG_SHELL=y
    CONFIG_DEBUG_COREDUMP_SHELL=y
    CONFIG_DEBUG_COREDUMP=y
    CONFIG_DEBUG_COREDUMP_BACKEND_FLASH_PARTITION=y
    CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_MIN=y
    CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_LINKER_RAM=n
    # Must be disabled to allow flash write from exception handler.
    # https://github.com/zephyrproject-rtos/zephyr/issues/59116
    CONFIG_ASSERT=n

    And the following code in main.c to raise a fault when the button is pressed (Button 1 on DK)

    static void do_fault(void)
    {
    	*(uint32_t *) 0xFFFFFFFF = 1;
    }
    
    static void button_changed(uint32_t button_state, uint32_t has_changed)
    {
    	if (has_changed & USER_BUTTON) {
    		uint32_t user_button_state = button_state & USER_BUTTON;
    
    		bt_lbs_send_button_state(user_button_state);
    		app_button_state = user_button_state ? true : false;
    		do_fault();
    		
    	}
    }

    And lastly, the Devictree overlay to allocate the coredump partition in flash (this is for the nRF52840):

    &flash0 {
        partitions {
            storage_partition: partition@f8000 {
                reg = <0x000f8000 0x00004000>;
            };
            coredump_partition: partition@fc000 {
    			label = "coredump_partition";
                reg = <0x000fc000 0x00004000>;
            };
    
        };
    };

    Testing

    1. After the fault has been triggered, connect a serial terminal to the board to access the stored coredump via the shell

    2. Copy the coredump data to a text file. E.g., coredump.log. Then perform step 1 to 4 here https://docs.zephyrproject.org/latest/services/debugging/coredump.html#example 

    Result

    peripheral_lbs_coredump_flash$ arm-zephyr-eabi-gdb build/zephyr/zephyr.elf 
    GNU gdb (Zephyr SDK 0.16.0) 12.1
    Copyright (C) 2022 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.
    Type "show copying" and "show warranty" for details.
    This GDB was configured as "--host=x86_64-build_pc-linux-gnu --target=arm-zephyr-eabi".
    Type "show configuration" for configuration details.
    For bug reporting instructions, please see:
    <https://github.com/zephyrproject-rtos/sdk-ng/issues>.
    Find the GDB manual and other documentation resources online at:
        <http://www.gnu.org/software/gdb/documentation/>.
    
    For help, type "help".
    Type "apropos word" to search for commands related to "word"...
    Reading symbols from build/zephyr/zephyr.elf...
    (gdb) target remote localhost:1234
    Remote debugging using localhost:1234
    0x000113b0 in button_changed (button_state=<optimized out>, has_changed=<optimized out>) at ../src/main.c:177
    177     }
    (gdb) bt
    #0  0x000113b0 in button_changed (button_state=<optimized out>, has_changed=<optimized out>) at ../src/main.c:177
    #1  0x00000000 in ?? ()
    (gdb) info registers 
    r0             0xfffffff3          -13
    r1             0x1                 1
    r2             0x1                 1
    r3             0xfffff000          -4096
    r4             0x0                 0
    r5             0x0                 0
    r6             0x0                 0
    r7             0x0                 0
    r8             0x0                 0
    r9             0x0                 0
    r10            0x0                 0
    r11            0x0                 0
    r12            0xffffffff          -1
    sp             0x2000ab18          0x2000ab18 <sys_work_q_stack+2008>
    lr             0x113b1             70577
    pc             0x113b0             0x113b0 <button_changed+28>
    xpsr           0x1000000           16777216
    (gdb) 

    Project

    peripheral_lbs_coredump_flash.zip

  • Hi,

    I did explore this but i had issue with writing to external flash. 

    I think the zephyr method only supports writing to internal uC flash.

    Cheers,

    Sam

Reply Children
Related