having issues with saving coredump to flash or at all

Hi Nordic

I am working with nrf52840 and nrf52832 using ncs v2.8.0

I am trying to save coredump to flash according to instructions on this link - https://docs.nordicsemi.com/bundle/ncs-2.8.0/page/zephyr/services/debugging/coredump.html

I added this to my pm_static_my_board.yml

coredump_partition:
  address: 0xCF000
  size: 0x8000
  region: flash_primary

And this to my_board.overlay

&flash0 {
    /*
     * For more information, see:
     * http: //docs.zephyrproject.org/latest/guides/dts/index.html#flash-partitions
     */
    partitions {
        compatible = "fixed-partitions";
        #address-cells = <1>;
        #size-cells = <1>;

      ...
        coredump_partition: partition@000080000 { //THIS IS NOT LEGIT ADDRESS(END OF FLASH) BUT IT IS NOT TAKEN TO ACOUNT BECAUS PM_STATIC IS
            label = "coredump-partition";
            reg = <0x000080000 DT_SIZE_K(4)>;
        };
    };

A side note is that this is strange that I need to set it in the overlay which is basically ignored because pm_static partitions is the one that actually matters (unless i got something wrong ? )

And this configs to my prj.conf

# Coredump 
CONFIG_DEBUG_COREDUMP=y
CONFIG_DEBUG_COREDUMP_BACKEND_FLASH_PARTITION=y
CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_THREADS=y

In my my_board/my_app/zephyr/.config i see this coredump related configs

CONFIG_ARCH_SUPPORTS_COREDUMP=y
CONFIG_ARCH_SUPPORTS_COREDUMP_THREADS=y

# CONFIG_COREDUMP_DEVICE is not set

CONFIG_DEBUG_THREAD_INFO=y
CONFIG_DEBUG_COREDUMP=y
# CONFIG_DEBUG_COREDUMP_BACKEND_LOGGING is not set
CONFIG_DEBUG_COREDUMP_BACKEND_FLASH_PARTITION=y
# CONFIG_DEBUG_COREDUMP_BACKEND_OTHER is not set
# CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_MIN is not set
CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_THREADS=y
# CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_LINKER_RAM is not set
CONFIG_DEBUG_COREDUMP_FLASH_CHUNK_SIZE=64
CONFIG_DEBUG_COREDUMP_THREADS_METADATA=y

I am generating a coredump using this implementation 

void trigger_coredump(void)
{
    __ASSERT(0, "Forcing coredump");
}

When i try to read the flash area after generating the coredump with nrfjprog --memrd 0xCF000 --w 32 --n 0x8000
i get all 0xFF 

what i am missing ?

I also tried to check myself by replacing CONFIG_DEBUG_COREDUMP_BACKEND_FLASH_PARTITION=y

With CONFIG_DEBUG_COREDUMP_BACKEND_LOGGING=y

Hopping to see the coredump on my open rtt but nothing .. when coredump is triggered prints just stop

  1. What am I missing? Why can't I find a coredump on the flash partition or in the rtt log ?
  2. Can it be that the device does not have the time to write the coredump before the actual crash ? If so, how can I manage that ?
  3. Is there some auto deletion of the flash partition with the coredump so new coredumps can be saved or is it something i have to manage myself after i read the coredump from flash ? 

Hope to read you soon

Best regards

Ziv

Parents Reply
  • hi Vidar

    What is relevant here is whether you have CONFIG_ASSERT enabled in your build not

    we use assert a lot in our code and also zephyr uses it internally so this is why it is wird for me and also why i try to avoid it plus i don't seem to be getting to the writing of CD to flash at the moment .. anyway i'll try to run with it disabled and see

    i can not use RAM for saving logs or CD since the devices in the field are configured with logs disabled and i need to know what happens there in a retrospective, after they reset and reconnected with a node

    i don't mind using the internal flash or the external (external will be more comfortable obviously but internal will work for now as well) 

    You can simply redefine the weakly defined k_sys_fatal_error_handler() function in your application

    the api seems to have fatal error reason and exception context .. are those filled automatically ?

    just to be sure if i overwrite this api then i do not have to CONFIG_ASSERT=n ?

    p.s. this api seems to be called from inside z_fatal_error which i don't thing i am getting into since i both don't see its prints and putting a breakpoint at the beginning of it when debugging did not stop there 

Children
  • we use assert a lot in our code and also zephyr uses it internally so this is why it is wird for me and also why i try to avoid it plus i don't seem to be getting to

    Please read my previous comments where I try to explain what the issue is. If you are going to have ASSERTs enabled you must use the flash storage backend introduced by the commit I linked.

    i can not use RAM for saving logs or CD since the devices in the field are configured with logs disabled and i need

    My suggestion is to not use the CD functionality at all but rather store relevant information to RAM from the k_sys_fatal_error_handler(). What you do with the RAM content on subsequent reboot is up to you. You can store it to flash, transfer it over BLE, etc.

  • tore relevant information to RAM from the k_sys_fatal_error_handler(). What you do with the RAM content on subsequent reboot is up to you. You can store it to flash, transfer it over BLE, etc.

    i think there is something fundamental i am missing here .. as far as i know ,whatever i save in RAM to some variable or whatever, is gone after reset, right ? . so, if i do not use logs and can only get info on the crash from the device via OTA, after it resets back to normal, then how saving things to RAM help me ? 

  • Hey Ziv.

    There is a special no_init area of the RAM which is persistent through a hot (soft) reset. Hot reset = reset command to the MCU; cold reset = complete power cycle. There is a CONFIG_ option for the system to do a hot reset in case of stack overflows/ hard faults -> the no_init area of the RAM is kept. Wink

  • Please read my previous comments where I try to explain what the issue is

    i read the issue here https://github.com/zephyrproject-rtos/zephyr/issues/59116 and the solution here https://github.com/nrfconnect/sdk-nrf/pull/21418 (which i think i may try to integrate into my ncs v2.8.0) untill we will update to ncs v3.1.1) 

    what i still don't understand is the order of things :

    if i disable asserts then i see i am entering the  'z_arm_fault()' and after that to 'z_fatal_error()'  which inside it call for coredump() and then to k_sys_fatal_error_handler() so how come it can be overwriten if when i enable asserts i am not getting into z_arm_fault() or z_fatal_error() ? ..

    if i enable asserts and i am trying to use coredump even for logs (

    CONFIG_DEBUG_COREDUMP_BACKEND_LOGGING  or, 
    CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_LINKER_RAM

    ) and not for flash i just don't see anything .. does the assertion trigger some hardware before zephyr apis can even act ? if so, why then when coredump is not configured then i do get into 'z_arm_fault()'  and  'z_fatal_error()'  when i have an assert and asserts are enabled ?

    plus why if assert is enabled i don't get to zephyr apis even if i crash it with k_panic(), for example ?

    sorry for being a nag about this Pray but i am really trying to understand the order of which things are happening when crash occurs Pray

    p.s. inside zephyr and ncs there is also a vast use of asserts .. what happning in those asserts when assert is disabled are he checks just skipped ? is it a valid practice to have assert enabled just for "on table" development and have a build with no asserts for deployed devices ?

    i tried to overwrite the crash with my own implementation of this :

    void k_sys_fatal_error_handler(unsigned int reason, const struct arch_esf *esf){
        LOG_ERR(">>> HardFault trapped in app override!\n");
    
        /* Option 1: loop forever (easy for breakpoints) */
        while (1) {
            __NOP();
        }
         
        /* Option 2: chain to Zephyr’s internal handler */
        // extern void z_arm_fault(uint32_t reason, const z_arch_esf_t *esf);
        // z_arm_fault(K_ERR_CPU_EXCEPTION, NULL);
    }

    but i did not see that it is actually overwriting something .. i don't see it's prints and i do see that z_fatal_error prints 

    There is a special no_init area of the RAM

    thanks, but our application is already using most of available RAM so giving some of it up just for saving crashes is not a valid option 

    best regards

    Ziv

  • Well, what Vidar suggests is NOT using Coredump with RAM, but rather using some variables declared in the no_init area to save the crash log data in them and then to check them when you start up again.

    thanks, but our application is already using most of available RAM so giving some of it up just for saving crashes is not a valid option 

    I get it. We're also REALLY tight on RAM. We didn't hit the limit yet, but I expect we will in the next few months/ soon. But that's also the point! Enabling and using the Coredump feature costs you WAAAAY more ROM and RAM than simply doing his suggestion, which takes up ~70-80 bytes. Sunglasses

    Note: this whole "RAM implementation" is based on the two variables that he mentioned in that 2 year old ticket:

    __noinit static z_arch_esf_t esf;
    __noinit static uint32_t esf_crc;

    I tried checking out "z_arch_esf_t" to see its size so I can get a rough estimate of the amount of RAM used. This led me down a small rabbit hole, which took me to: https://docs.zephyrproject.org/latest/releases/migration-guide-3.7.html

    Where we find this phrase:

    Migration guide to Zephyr v3.7.0
    This document describes the changes required when migrating your application from Zephyr v3.6.0 to Zephyr v3.7.0.
    
    ...
    
    Kernel
    All architectures are now required to define the new struct arch_esf, which describes the members of a stack frame. This new struct replaces the named struct z_arch_esf_t. (GitHub #73593)
    
    The named struct z_arch_esf_t is now deprecated. Use struct arch_esf instead. (GitHub #73593)
    
    The header file include/zephyr/arch/arch_interface.h has been moved from include/zephyr/sys/ into include/zephyr/arch/. Out-of-tree source files will need to update the include path. (GitHub #64987)

    So those previous two lines turn into:

    __noinit static arch_esf esf;
    __noinit static uint32_t esf_crc;

    Keep in mind that we have NCS version 3.0.0, in which Zephyr has version v4.0.99-a0e545cb437a. You mentioned you're using v2.8.0 and I'm not sure which Zephyr version comes with that version of the NCS. So you gotta check if your Zephyr is >= v3.7.0 for the above to apply.

    I doubt Coredump beats this given that the serial CLI logs it generates fill up a file with ~35kb of text.

Related