having issues with saving coredump to flash or at all

ziv123 9 months ago

Hi Nordic

I am working with nrf52840 and nrf52832 using ncs v2.8.0

I am trying to save coredump to flash according to instructions on this link - https://docs.nordicsemi.com/bundle/ncs-2.8.0/page/zephyr/services/debugging/coredump.html

I added this to my pm_static_my_board.yml

coredump_partition:
  address: 0xCF000
  size: 0x8000
  region: flash_primary

And this to my_board.overlay

&flash0 {
    /*
     * For more information, see:
     * http: //docs.zephyrproject.org/latest/guides/dts/index.html#flash-partitions
     */
    partitions {
        compatible = "fixed-partitions";
        #address-cells = <1>;
        #size-cells = <1>;

      ...
        coredump_partition: partition@000080000 { //THIS IS NOT LEGIT ADDRESS(END OF FLASH) BUT IT IS NOT TAKEN TO ACOUNT BECAUS PM_STATIC IS
            label = "coredump-partition";
            reg = <0x000080000 DT_SIZE_K(4)>;
        };
    };

A side note is that this is strange that I need to set it in the overlay which is basically ignored because pm_static partitions is the one that actually matters (unless i got something wrong ? )

And this configs to my prj.conf

# Coredump 
CONFIG_DEBUG_COREDUMP=y
CONFIG_DEBUG_COREDUMP_BACKEND_FLASH_PARTITION=y
CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_THREADS=y

In my my_board/my_app/zephyr/.config i see this coredump related configs

CONFIG_ARCH_SUPPORTS_COREDUMP=y
CONFIG_ARCH_SUPPORTS_COREDUMP_THREADS=y

# CONFIG_COREDUMP_DEVICE is not set

CONFIG_DEBUG_THREAD_INFO=y
CONFIG_DEBUG_COREDUMP=y
# CONFIG_DEBUG_COREDUMP_BACKEND_LOGGING is not set
CONFIG_DEBUG_COREDUMP_BACKEND_FLASH_PARTITION=y
# CONFIG_DEBUG_COREDUMP_BACKEND_OTHER is not set
# CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_MIN is not set
CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_THREADS=y
# CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_LINKER_RAM is not set
CONFIG_DEBUG_COREDUMP_FLASH_CHUNK_SIZE=64
CONFIG_DEBUG_COREDUMP_THREADS_METADATA=y

I am generating a coredump using this implementation

void trigger_coredump(void)
{
    __ASSERT(0, "Forcing coredump");
}

When i try to read the flash area after generating the coredump with nrfjprog --memrd 0xCF000 --w 32 --n 0x8000
i get all 0xFF

what i am missing ?

I also tried to check myself by replacing CONFIG_DEBUG_COREDUMP_BACKEND_FLASH_PARTITION=y

With CONFIG_DEBUG_COREDUMP_BACKEND_LOGGING=y

Hopping to see the coredump on my open rtt but nothing .. when coredump is triggered prints just stop

What am I missing? Why can't I find a coredump on the flash partition or in the rtt log ?
Can it be that the device does not have the time to write the coredump before the actual crash ? If so, how can I manage that ?
Is there some auto deletion of the flash partition with the coredump so new coredumps can be saved or is it something i have to manage myself after i read the coredump from flash ?

Hope to read you soon

Best regards

Ziv

Top Replies

0 Vidar Berg 9 months ago in reply to Tudor B.

The "missing" 64k is allocated to the shared RAM region and is used for communication with the network core.
Cancel
Vote Up +1 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Tudor B. 9 months ago in reply to Vidar Berg
Thank you for the explanation.

Beyond the configuration:

&sram0 { reg = <0x20000000 DT_SIZE_K(458752) - 150; };

How would I set the variables in those 150 bytes that are reserved?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Vidar Berg 9 months ago in reply to Tudor B.

You can find the end address of RAM programmatically using symbols generated by the partition manager in pm_config.h, for example, PM_SRAM_PRIMARY_END_ADDRESS. Then use this as the start address when copying in the crash data.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 ziv123 8 months ago in reply to Vidar Berg
hi Vidar

Vidar Berg said:
You need to use LOG_PANIC() to be able to flush the log buffer from the fault handler.

thanks for the LOG_PANIC() it helped me follow what goes in a crash scenario but it brings this questions:

1. according to log ahead, it seems that z_arm_fault precede z_fatal_error() which calls for coredump() and only after that calls for k_sys_fatal_error_handler() .. so i don't understand if memfault's implementation is overwriting this function to handle the crash data collection, it is done after coredump so why does it not get stuck even when asserts are enabled .... is it related to CONFIG_DEBUG_COREDUMP_BACKEND_OTHER though i don't see it used in coredump_backend_flash_partition.c ? or if i want to get to that handler i need to CONFIG_DEBUG_COREDUMP=n and only then i will get to this handler before system hults ? and in it i can call for the coredump save ? ?

2. when working with assert enabled , according to the log at crash i am entering z_arm_fault() twice .. any idea why it is so ?

this is the logs:

with assert enable crash trigger by fault bus: 00> [00000006] <inf> AUGU_PUSH_B[00000007] <inf> os: In z_arm_fault 00> [00000007] <err> os: ***** BUS FAULT ***** 00> [00000007] <err> os: Imprecise data bus error .... 00> [00000007] <err> os: Faulting instruction address (r15/pc): 0x000xxxx 00> [00000008] <err> os: in z_fatal_error 00> [00000008] <err> os: >>> ZEPHYR FATAL ERROR 26: Unknown error on CPU 0 00> [00000008] <err> os: Current thread: 0xFFFFFFFF (augu_work_q) 00> ASSERTION FAIL [((arch_is_in_isr() == 0) || ((timeout).ticks == (((k_timeout_t) {0})).ticks))] @ WEST_TOPDIR/zephyr/kernel/sem.c:136 00> 00> [00000008] <inf> os: In z_arm_fault 00> [00000008] <err> os: ***** HARD FAULT ***** 00> [00000008] <err> os: Fault escalation (see below) 00> [00000008] <err> os: ARCH_EXCEPT with reason 4 00> 00> [00000008] [1;31m<err> os: .... 00> [00000008] <err> os: Faulting instruction address (r15/pc): 0x000XXXX 00> [00000008] <err> os: in z_fatal_error 00> [00000008] [1;31m<err> os: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0 00> [00000008] <err> os: Fault during interrupt handling 00> 00> [00000008] <err> os: Current thread: 0xFFFFFFFF (augu_work_q) 00> ASSERTION FAIL [((arch_is_in_isr() == 0) || ((timeout).ticks == (((k_timeout_t) {0})).ticks))] @ WEST_TOPDIR/zephyr/kernel/sem.c:136

3. when using the coredump_backend_nrf_flash_partition.c (i added some prints in it) i see that in the middle of coredump work i get the print for the event of the button push (i put the crash at start of button push handler) this is strange i think .. any idea why it suddenly pops ?

4. when i use default coredump flow (coredump_backend_flash_partition.c) with CONFIG_ASSERT=n, and i try to crash with k_panic or k_oops, i still fail in saving coredump because of a timeout error at stream_flash_init(..) .. so it seems like default coredump is only supported when BUS FAULT errors ( ZEPHYR FATAL ERROR 26: Unknown error on CPU) .. this is an issue because when an end device in the field get stuck on some scenario i want to reset it in that scenario immediately and not loos this device or waste energy and time in a watchdog feed mechanism .. is there a build in solution for handling dead end scenarios in the end device ? (obviously i want to know that the device failed and what happened before and how it got to this point this is why i need the coredump, usually those scenarios can be device fails to bond or some indication of memory corruption that happened)

5. i still haven't figured out why data isn't saved to flash properly in my use of coredump_backend_nrf_flash_partition.c adding prints in it i do see valid coredump values but for some reason it does not seem to be written to flash partition properly .. will update on that (i also took it to my app instead of using it from within ncs (this way i can keep ncs clean and it is easier if i need to modify the module)

hope to read you soon

best regards

Ziv
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Vidar Berg 8 months ago in reply to ziv123

ziv123 said:
1. when i try to implement my k_sys_fatal_error_handler() in my main it does not overwrite anything, i still get to zephyr error handling and my error handler is never called. what am i missing ?

Please post a code snippet showing how you have implemented this function. You can also check the zephyr.map file for your build to confirm if you were able to redefine the function.

ziv123 said:
2. when working with assert enabled , according to the log at crash i am entering z_arm_fault() twice .. any idea why it is so ?

As mentioned earlier, the CD flash backend is using semaphores (with timeouts) which will raise an assert when invoked from an interrupt context.

ziv123 said:
3. when using the coredump_backend_nrf_flash_partition.c (i added some prints in it) i see that in the middle of coredump work i get the print for the event of the button push (i put the crash at start of button push handler) this is strange i think .. any idea why it suddenly pops ?

I expect all but zero latency interrupts to be masked at this point.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel