having issues with saving coredump to flash or at all

ziv123 8 months ago

Hi Nordic

I am working with nrf52840 and nrf52832 using ncs v2.8.0

I am trying to save coredump to flash according to instructions on this link - https://docs.nordicsemi.com/bundle/ncs-2.8.0/page/zephyr/services/debugging/coredump.html

I added this to my pm_static_my_board.yml

coredump_partition:
  address: 0xCF000
  size: 0x8000
  region: flash_primary

And this to my_board.overlay

&flash0 {
    /*
     * For more information, see:
     * http: //docs.zephyrproject.org/latest/guides/dts/index.html#flash-partitions
     */
    partitions {
        compatible = "fixed-partitions";
        #address-cells = <1>;
        #size-cells = <1>;

      ...
        coredump_partition: partition@000080000 { //THIS IS NOT LEGIT ADDRESS(END OF FLASH) BUT IT IS NOT TAKEN TO ACOUNT BECAUS PM_STATIC IS
            label = "coredump-partition";
            reg = <0x000080000 DT_SIZE_K(4)>;
        };
    };

A side note is that this is strange that I need to set it in the overlay which is basically ignored because pm_static partitions is the one that actually matters (unless i got something wrong ? )

And this configs to my prj.conf

# Coredump 
CONFIG_DEBUG_COREDUMP=y
CONFIG_DEBUG_COREDUMP_BACKEND_FLASH_PARTITION=y
CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_THREADS=y

In my my_board/my_app/zephyr/.config i see this coredump related configs

CONFIG_ARCH_SUPPORTS_COREDUMP=y
CONFIG_ARCH_SUPPORTS_COREDUMP_THREADS=y

# CONFIG_COREDUMP_DEVICE is not set

CONFIG_DEBUG_THREAD_INFO=y
CONFIG_DEBUG_COREDUMP=y
# CONFIG_DEBUG_COREDUMP_BACKEND_LOGGING is not set
CONFIG_DEBUG_COREDUMP_BACKEND_FLASH_PARTITION=y
# CONFIG_DEBUG_COREDUMP_BACKEND_OTHER is not set
# CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_MIN is not set
CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_THREADS=y
# CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_LINKER_RAM is not set
CONFIG_DEBUG_COREDUMP_FLASH_CHUNK_SIZE=64
CONFIG_DEBUG_COREDUMP_THREADS_METADATA=y

I am generating a coredump using this implementation

void trigger_coredump(void)
{
    __ASSERT(0, "Forcing coredump");
}

When i try to read the flash area after generating the coredump with nrfjprog --memrd 0xCF000 --w 32 --n 0x8000
i get all 0xFF

what i am missing ?

I also tried to check myself by replacing CONFIG_DEBUG_COREDUMP_BACKEND_FLASH_PARTITION=y

With CONFIG_DEBUG_COREDUMP_BACKEND_LOGGING=y

Hopping to see the coredump on my open rtt but nothing .. when coredump is triggered prints just stop

What am I missing? Why can't I find a coredump on the flash partition or in the rtt log ?
Can it be that the device does not have the time to write the coredump before the actual crash ? If so, how can I manage that ?
Is there some auto deletion of the flash partition with the coredump so new coredumps can be saved or is it something i have to manage myself after i read the coredump from flash ?

Hope to read you soon

Best regards

Ziv

Top Replies

Parents

0 runsiv 8 months ago

Hi

I will look into your case. Just a quick question to start with. Are you using MCUBOOT also?
Regards

Runar
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 ziv123 8 months ago in reply to runsiv

any news on that ?
Cancel
Vote Up +1 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Tudor B. 7 months ago in reply to ziv123

Hello Ziv.

I'm not a Nordic employee but am currently working on the same thing as you. I've got it working with the serial CLI and am currently struggling to get it working with internal or external flash.

Will keep you updated if I have any breakthroughs.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 ziv123 7 months ago in reply to Tudor B.

thanks Tudor

p.s. do you know if there is a way to debug what is happening after an assert

cause i have a branch that works with memfault and there i see all the relevant prints plus writing to flash and maybe if i can debug the 2 roots i can find out what is missing in my branch
Cancel
Vote Up +1 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Tudor B. 7 months ago in reply to ziv123
It's a bit hard since I assume a hard fault/ stack overflow blocks all further instructions from running. What I found from my experience is that you can inject your own message near your point of interest and see which branch it goes through, other adjacent branches and what conditions you need to trigger them, etc.

For example, given this usage fault:

**** Using Zephyr OS v4.0.99-a0e545cb437a *** [00:00:00.297,698] <inf> flashdisk: Initialize device NAND [00:00:00.297,729] <inf> flashdisk: offset 300000, sector size 512, page size 4096, volume size 4194304 [00:00:14.148,559] <err> os: ***** USAGE FAULT ***** [00:00:14.148,559] <err> os: Attempt to execute undefined instruction [00:00:14.148,590] <err> os: r0/a1: 0x0bad0000 r1/a2: 0x00000000 r2/a3: 0x00000000 [00:00:14.148,590] <err> os: r3/a4: 0xffffffff r12/ip: 0x0004e4bb r14/lr: 0x0001b203 [00:00:14.148,620] <err> os: xpsr: 0x49100000 [00:00:14.148,620] <err> os: s[ 0]: 0x200099e4 s[ 1]: 0x00000000 s[ 2]: 0x00000009 s[ 3]: 0x00021cc7 [00:00:14.148,651] <err> os: s[ 4]: 0x00000001 s[ 5]: 0x00000030 s[ 6]: 0x0005ac30 s[ 7]: 0x0004dd57 [00:00:14.148,651] <err> os: s[ 8]: 0x00000000 s[ 9]: 0x200099e0 s[10]: 0x20009ae0 s[11]: 0x00001972 [00:00:14.148,681] <err> os: s[12]: 0x20005a94 s[13]: 0x0001df3b s[14]: 0x20005a94 s[15]: 0x00001000 [00:00:14.148,681] <err> os: fpscr: 0xffffffff [00:00:14.148,681] <err> os: Faulting instruction address (r15/pc): 0x00017a88 [00:00:14.148,712] <err> os: >>> ZEPHYR FATAL ERROR 36: Unknown error on CPU 0 [00:00:14.148,742] <err> os: Current thread: 0x20002758 (mp_main) [00:00:14.273,651] <err> os: Halting system

I wanna see what triggers it, so I search for: " Attempt to execute undefined instruction" and found it here:

/opt/nordic/ncs/v3.0.0/zephyr/arch/arm/core/cortex_m/fault.c:550:

PR_FAULT_INFO(" Attempt to execute undefined instruction");

inside the function "static uint32_t usage_fault(const struct arch_esf *esf)".

It might seem very basic/ rudimentary, but it helped me overcome various hurdles when working with different Zephyr/ Nordic features.

==============================

Also worth mentioning is this post:
RE: Saving coredumps to external flash

Where they mention that:
"To get the ESF you can override Zephyr's fatal function and simply store the values in retained memory (as the above example shows).

void k_sys_fatal_error_handler(unsigned int reason, const z_arch_esf_t *esf_input)"

Which in my interpretation means that "k_sys_fatal_error_handler()" is the function that you're looking to debug.

==============================

I looked into Memfault conceptually, but it's a much bigger feature (from a ROM and RAM consumption perspective) than simply having the Coredump being saved to flash/ external flash.

I can provide you a working sample for printing it to the serial CLI, using:

CONFIG_DEBUG_COREDUMP_BACKEND_LOGGING=y

Would that be of any use?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 ziv123 7 months ago in reply to Tudor B.

Tudor B. said:
For example, given this usage fault:

this is not helping my case, first i don't even get this build in log at my current branch (i meant branch as a git branch from my main development branch which currently works with memfault) i am generating the assertion with __ASSERT(0, ..) so i know where it happens i just want to see that i know to save the coredump into flash and currently it does not .. and like i mentioned even does not prints the logs you mentioned .. so unfortunately no use for this at the moment

Tudor B. said:
Also worth mentioning is this post:
RE: Saving coredumps to external flash

also tried what Vidar Berg did in his attempt and it did not work for me

and i am not sure why he put CONFIG_ASSERT=n .. i want to be able to catch those as well
Cancel
Vote Up +1 Vote Down

Sign in to reply

Verify Answer

Cancel
0 runsiv 7 months ago in reply to ziv123

Hi Ziv and sorry for the delay. From my testing it seems like I'm currently not able to catch asserts with coredump. Triggering another fatal error works like s charm, so I suspect there is something with the configuration of the error fault handler.

Regards

Runar
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 runsiv 7 months ago in reply to ziv123

Hi Ziv and sorry for the delay. From my testing it seems like I'm currently not able to catch asserts with coredump. Triggering another fatal error works like s charm, so I suspect there is something with the configuration of the error fault handler.

Regards

Runar
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

0 ziv123 7 months ago in reply to runsiv
hi Runsiv

runsiv said:
catch asserts with coredump

i wonder why there is no problem to save coredump with assertion when using memfault instead of only trying to capture coredump ???

also, tried to generate a crash another way with

void trigger_coredump(void) { *(uint32_t *) 0xFFFFFFFF = 1; // __ASSERT(0, "Forcing coredump"); }

instead of using assert

same results

no logs that are usually build in in zephyr for telling you where the crash happened and off course no coredump saved to memory ..

if there is any farther data i can share to get some direction for this please let me know

i am kind of stuck on this feature which supposed to be a builtin supported feature both in nordic and in zephyr

[ https://docs.nordicsemi.com/bundle/ncs-2.4.2/page/zephyr/services/debugging/coredump.html

docs.zephyrproject.org/.../coredump.html ]

hope to read you soon

best regards

Ziv
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Vidar Berg 7 months ago in reply to ziv123

ziv123 said:
and i am not sure why he put CONFIG_ASSERT=n .. i want to be able to catch those as well

Just above this line in the config I included a link to the Zephyr issue explaining why asserts need to be disabled: https://github.com/zephyrproject-rtos/zephyr/issues/59116. The short answer is that flash drivers are using semaphores/mutexes that can't be acquired while in an ISR (which is true when you need to store a coredump after a fault exception) and if you have CONFIG_ASSERT enabled it will trigger an assert before the CD is stored to flash.

If you look at the memfault implementation you will see that they are taking a different approach to this and instead of using the zephyr drivers use the NVMC nrfx HAL directly.

https://github.com/nrfconnect/sdk-nrf/blob/v3.0.1/modules/memfault-firmware-sdk/memfault_flash_coredump_storage.c#L15

You can also redefine the weakly declared error handler and store the data yourself. Have you considered just saving the CPU registers from the last stack frame like in the crash log message? Or do you feel you get more information from a coredump?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Tudor B. 7 months ago in reply to Vidar Berg

Hey Vidar.

Sorry if it seems I'm "hijacking" this thread, but actually I'm just piggybacking.

I am in exactly the same boat as Ziv, this being my current task: get the coredump working with external flash. I managed to get it working with the serial CLI log but that's pretty much where I got stuck.

Could you please provide both me and Ziv an archive with a working sample code that saves the Coredump to external flash? From the previous features which I had trouble with, I found this approach to be by far the fastest way of getting them to work.

Note: he seems to be on NCS v2.8.0 and I'm on NCS v3.0.0.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Vidar Berg 7 months ago in reply to Tudor B.

Hi,

To write to external flash, you would need a QSPI/SPI and flash driver tailored for the coredump that is capable of operating in the hardfault interrupt context, which we don’t have. But as I asked OP, what’s the motivation for storing a minimal coredump when you could instead store just the information that would normally be included in the crash logs? That would only require a few 10s of bytes, not several kilobytes. Have you tested a minimal coredump using another backends to see what you actually get?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Tudor B. 7 months ago in reply to Vidar Berg

I'm not sure what's the difference between "a minimal coredump" and "information included in the crash logs"?

For my case in particular, we want the coredump since we want to see what causes stack overflows/ hard faults/ etc.:

Additionally, it'd be great to store the reset reason also and on top of that to have "custom reset reasons". I.e.: let's say I wanna do an OTA, the file is downloaded on the device and we're gonna reset to boot from the new image -> store as reset reason "successful OTA, reboot to new image"; or let's say a device is pushing LoRa data to a server but it didn't get any confirmation (ACK) of reception from the server/ gateway for the last few cycles, so the module decides to reset -> store as reset reason "failed several consecutive uplinks to server, reboot to reconnect". Does such a feature exist?

But mainly: coredump to external flash with stack trace for stack overflows/ hard faults, just like in the image above.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel