having issues with saving coredump to flash or at all

ziv123 9 months ago

Hi Nordic

I am working with nrf52840 and nrf52832 using ncs v2.8.0

I am trying to save coredump to flash according to instructions on this link - https://docs.nordicsemi.com/bundle/ncs-2.8.0/page/zephyr/services/debugging/coredump.html

I added this to my pm_static_my_board.yml

coredump_partition:
  address: 0xCF000
  size: 0x8000
  region: flash_primary

And this to my_board.overlay

&flash0 {
    /*
     * For more information, see:
     * http: //docs.zephyrproject.org/latest/guides/dts/index.html#flash-partitions
     */
    partitions {
        compatible = "fixed-partitions";
        #address-cells = <1>;
        #size-cells = <1>;

      ...
        coredump_partition: partition@000080000 { //THIS IS NOT LEGIT ADDRESS(END OF FLASH) BUT IT IS NOT TAKEN TO ACOUNT BECAUS PM_STATIC IS
            label = "coredump-partition";
            reg = <0x000080000 DT_SIZE_K(4)>;
        };
    };

A side note is that this is strange that I need to set it in the overlay which is basically ignored because pm_static partitions is the one that actually matters (unless i got something wrong ? )

And this configs to my prj.conf

# Coredump 
CONFIG_DEBUG_COREDUMP=y
CONFIG_DEBUG_COREDUMP_BACKEND_FLASH_PARTITION=y
CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_THREADS=y

In my my_board/my_app/zephyr/.config i see this coredump related configs

CONFIG_ARCH_SUPPORTS_COREDUMP=y
CONFIG_ARCH_SUPPORTS_COREDUMP_THREADS=y

# CONFIG_COREDUMP_DEVICE is not set

CONFIG_DEBUG_THREAD_INFO=y
CONFIG_DEBUG_COREDUMP=y
# CONFIG_DEBUG_COREDUMP_BACKEND_LOGGING is not set
CONFIG_DEBUG_COREDUMP_BACKEND_FLASH_PARTITION=y
# CONFIG_DEBUG_COREDUMP_BACKEND_OTHER is not set
# CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_MIN is not set
CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_THREADS=y
# CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_LINKER_RAM is not set
CONFIG_DEBUG_COREDUMP_FLASH_CHUNK_SIZE=64
CONFIG_DEBUG_COREDUMP_THREADS_METADATA=y

I am generating a coredump using this implementation

void trigger_coredump(void)
{
    __ASSERT(0, "Forcing coredump");
}

When i try to read the flash area after generating the coredump with nrfjprog --memrd 0xCF000 --w 32 --n 0x8000
i get all 0xFF

what i am missing ?

I also tried to check myself by replacing CONFIG_DEBUG_COREDUMP_BACKEND_FLASH_PARTITION=y

With CONFIG_DEBUG_COREDUMP_BACKEND_LOGGING=y

Hopping to see the coredump on my open rtt but nothing .. when coredump is triggered prints just stop

What am I missing? Why can't I find a coredump on the flash partition or in the rtt log ?
Can it be that the device does not have the time to write the coredump before the actual crash ? If so, how can I manage that ?
Is there some auto deletion of the flash partition with the coredump so new coredumps can be saved or is it something i have to manage myself after i read the coredump from flash ?

Hope to read you soon

Best regards

Ziv

Top Replies

Parents

0 runsiv 9 months ago

Hi

I will look into your case. Just a quick question to start with. Are you using MCUBOOT also?
Regards

Runar
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 ziv123 9 months ago in reply to runsiv

any news on that ?
Cancel
Vote Up +1 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Vidar Berg 9 months ago in reply to Tudor B.

The r14/lr register will tell you were the call was made from. But yes, there might of be cases were even a minimal CD will provide relevant debug information that's not shown in the crash log.
Cancel
Vote Up +1 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Tudor B. 9 months ago in reply to Vidar Berg

Thank you for the hasty reply Vidar.

Well, given your answer, could you provide us a minimal sample project with the necessary changes to be able to save the CD to the internal/ external flash? For me, external is preferred since we'll also want to save the reset reason and have it all in one place. and I saw that the logging CD generates ~34kb of data. Dunno what Zhiv prefers.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Vidar Berg 9 months ago in reply to Tudor B.

As mentioned earlier, writing to external flash requires a custom bare metal SPI flash driver that does not rely on any Zephyr primitives and can operate from the hardfault interrupt context. For internal flash, we already have a driver, see the PR linked earlier.

Tudor B. said:
For me, external is preferred since we'll also want to save the reset reason and have it all in one place

I'm not sure what you have in mind here. How would you include the reset reason in the CD.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 ziv123 9 months ago in reply to Tudor B.

i suggested this for you, before .. we use it in main when app re starts

uint32_t reset_reason = nrf_power_resetreas_get(NRF_POWER);
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 ziv123 9 months ago in reply to Vidar Berg

hi Vidar

Vidar Berg said:
What is relevant here is whether you have CONFIG_ASSERT enabled in your build not

we use assert a lot in our code and also zephyr uses it internally so this is why it is wird for me and also why i try to avoid it plus i don't seem to be getting to the writing of CD to flash at the moment .. anyway i'll try to run with it disabled and see

i can not use RAM for saving logs or CD since the devices in the field are configured with logs disabled and i need to know what happens there in a retrospective, after they reset and reconnected with a node

i don't mind using the internal flash or the external (external will be more comfortable obviously but internal will work for now as well)

Vidar Berg said:
You can simply redefine the weakly defined k_sys_fatal_error_handler() function in your application

the api seems to have fatal error reason and exception context .. are those filled automatically ?

just to be sure if i overwrite this api then i do not have to CONFIG_ASSERT=n ?

p.s. this api seems to be called from inside z_fatal_error which i don't thing i am getting into since i both don't see its prints and putting a breakpoint at the beginning of it when debugging did not stop there
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 ziv123 9 months ago in reply to Vidar Berg

hi Vidar

Vidar Berg said:
What is relevant here is whether you have CONFIG_ASSERT enabled in your build not

we use assert a lot in our code and also zephyr uses it internally so this is why it is wird for me and also why i try to avoid it plus i don't seem to be getting to the writing of CD to flash at the moment .. anyway i'll try to run with it disabled and see

i can not use RAM for saving logs or CD since the devices in the field are configured with logs disabled and i need to know what happens there in a retrospective, after they reset and reconnected with a node

i don't mind using the internal flash or the external (external will be more comfortable obviously but internal will work for now as well)

Vidar Berg said:
You can simply redefine the weakly defined k_sys_fatal_error_handler() function in your application

the api seems to have fatal error reason and exception context .. are those filled automatically ?

just to be sure if i overwrite this api then i do not have to CONFIG_ASSERT=n ?

p.s. this api seems to be called from inside z_fatal_error which i don't thing i am getting into since i both don't see its prints and putting a breakpoint at the beginning of it when debugging did not stop there
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

0 Vidar Berg 9 months ago in reply to ziv123

ziv123 said:
we use assert a lot in our code and also zephyr uses it internally so this is why it is wird for me and also why i try to avoid it plus i don't seem to be getting to

Please read my previous comments where I try to explain what the issue is. If you are going to have ASSERTs enabled you must use the flash storage backend introduced by the commit I linked.

ziv123 said:
i can not use RAM for saving logs or CD since the devices in the field are configured with logs disabled and i need

My suggestion is to not use the CD functionality at all but rather store relevant information to RAM from the k_sys_fatal_error_handler(). What you do with the RAM content on subsequent reboot is up to you. You can store it to flash, transfer it over BLE, etc.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 ziv123 9 months ago in reply to Vidar Berg

Vidar Berg said:
tore relevant information to RAM from the k_sys_fatal_error_handler(). What you do with the RAM content on subsequent reboot is up to you. You can store it to flash, transfer it over BLE, etc.

i think there is something fundamental i am missing here .. as far as i know ,whatever i save in RAM to some variable or whatever, is gone after reset, right ? . so, if i do not use logs and can only get info on the crash from the device via OTA, after it resets back to normal, then how saving things to RAM help me ?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Tudor B. 9 months ago in reply to ziv123

Hey Ziv.

There is a special no_init area of the RAM which is persistent through a hot (soft) reset. Hot reset = reset command to the MCU; cold reset = complete power cycle. There is a CONFIG_ option for the system to do a hot reset in case of stack overflows/ hard faults -> the no_init area of the RAM is kept.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 ziv123 9 months ago in reply to Vidar Berg
Vidar Berg said:
Please read my previous comments where I try to explain what the issue is

i read the issue here https://github.com/zephyrproject-rtos/zephyr/issues/59116 and the solution here https://github.com/nrfconnect/sdk-nrf/pull/21418 (which i think i may try to integrate into my ncs v2.8.0) untill we will update to ncs v3.1.1)

what i still don't understand is the order of things :

if i disable asserts then i see i am entering the 'z_arm_fault()' and after that to 'z_fatal_error()' which inside it call for coredump() and then to k_sys_fatal_error_handler() so how come it can be overwriten if when i enable asserts i am not getting into z_arm_fault() or z_fatal_error() ? ..

if i enable asserts and i am trying to use coredump even for logs (

CONFIG_DEBUG_COREDUMP_BACKEND_LOGGING or,

CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_LINKER_RAM

) and not for flash i just don't see anything .. does the assertion trigger some hardware before zephyr apis can even act ? if so, why then when coredump is not configured then i do get into 'z_arm_fault()' and 'z_fatal_error()' when i have an assert and asserts are enabled ?

plus why if assert is enabled i don't get to zephyr apis even if i crash it with k_panic(), for example ?

sorry for being a nag about this but i am really trying to understand the order of which things are happening when crash occurs

p.s. inside zephyr and ncs there is also a vast use of asserts .. what happning in those asserts when assert is disabled are he checks just skipped ? is it a valid practice to have assert enabled just for "on table" development and have a build with no asserts for deployed devices ?

i tried to overwrite the crash with my own implementation of this :

void k_sys_fatal_error_handler(unsigned int reason, const struct arch_esf *esf){ LOG_ERR(">>> HardFault trapped in app override!\n"); /* Option 1: loop forever (easy for breakpoints) */ while (1) { __NOP(); } /* Option 2: chain to Zephyr’s internal handler */ // extern void z_arm_fault(uint32_t reason, const z_arch_esf_t *esf); // z_arm_fault(K_ERR_CPU_EXCEPTION, NULL); }

but i did not see that it is actually overwriting something .. i don't see it's prints and i do see that z_fatal_error prints

Tudor B. said:
There is a special no_init area of the RAM

thanks, but our application is already using most of available RAM so giving some of it up just for saving crashes is not a valid option

best regards

Ziv
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Tudor B. 9 months ago in reply to ziv123
Well, what Vidar suggests is NOT using Coredump with RAM, but rather using some variables declared in the no_init area to save the crash log data in them and then to check them when you start up again.

ziv123 said:
thanks, but our application is already using most of available RAM so giving some of it up just for saving crashes is not a valid option

I get it. We're also REALLY tight on RAM. We didn't hit the limit yet, but I expect we will in the next few months/ soon. But that's also the point! Enabling and using the Coredump feature costs you WAAAAY more ROM and RAM than simply doing his suggestion, which takes up ~70-80 bytes.

Note: this whole "RAM implementation" is based on the two variables that he mentioned in that 2 year old ticket:

__noinit static z_arch_esf_t esf; __noinit static uint32_t esf_crc;

I tried checking out "z_arch_esf_t" to see its size so I can get a rough estimate of the amount of RAM used. This led me down a small rabbit hole, which took me to: https://docs.zephyrproject.org/latest/releases/migration-guide-3.7.html

Where we find this phrase:

Migration guide to Zephyr v3.7.0 This document describes the changes required when migrating your application from Zephyr v3.6.0 to Zephyr v3.7.0. ... Kernel All architectures are now required to define the new struct arch_esf, which describes the members of a stack frame. This new struct replaces the named struct z_arch_esf_t. (GitHub #73593) The named struct z_arch_esf_t is now deprecated. Use struct arch_esf instead. (GitHub #73593) The header file include/zephyr/arch/arch_interface.h has been moved from include/zephyr/sys/ into include/zephyr/arch/. Out-of-tree source files will need to update the include path. (GitHub #64987)

So those previous two lines turn into:

__noinit static arch_esf esf; __noinit static uint32_t esf_crc;

Keep in mind that we have NCS version 3.0.0, in which Zephyr has version v4.0.99-a0e545cb437a. You mentioned you're using v2.8.0 and I'm not sure which Zephyr version comes with that version of the NCS. So you gotta check if your Zephyr is >= v3.7.0 for the above to apply.

I doubt Coredump beats this given that the serial CLI logs it generates fill up a file with ~35kb of text.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel