having issues with saving coredump to flash or at all

ziv123 8 months ago

Hi Nordic

I am working with nrf52840 and nrf52832 using ncs v2.8.0

I am trying to save coredump to flash according to instructions on this link - https://docs.nordicsemi.com/bundle/ncs-2.8.0/page/zephyr/services/debugging/coredump.html

I added this to my pm_static_my_board.yml

coredump_partition:
  address: 0xCF000
  size: 0x8000
  region: flash_primary

And this to my_board.overlay

&flash0 {
    /*
     * For more information, see:
     * http: //docs.zephyrproject.org/latest/guides/dts/index.html#flash-partitions
     */
    partitions {
        compatible = "fixed-partitions";
        #address-cells = <1>;
        #size-cells = <1>;

      ...
        coredump_partition: partition@000080000 { //THIS IS NOT LEGIT ADDRESS(END OF FLASH) BUT IT IS NOT TAKEN TO ACOUNT BECAUS PM_STATIC IS
            label = "coredump-partition";
            reg = <0x000080000 DT_SIZE_K(4)>;
        };
    };

A side note is that this is strange that I need to set it in the overlay which is basically ignored because pm_static partitions is the one that actually matters (unless i got something wrong ? )

And this configs to my prj.conf

# Coredump 
CONFIG_DEBUG_COREDUMP=y
CONFIG_DEBUG_COREDUMP_BACKEND_FLASH_PARTITION=y
CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_THREADS=y

In my my_board/my_app/zephyr/.config i see this coredump related configs

CONFIG_ARCH_SUPPORTS_COREDUMP=y
CONFIG_ARCH_SUPPORTS_COREDUMP_THREADS=y

# CONFIG_COREDUMP_DEVICE is not set

CONFIG_DEBUG_THREAD_INFO=y
CONFIG_DEBUG_COREDUMP=y
# CONFIG_DEBUG_COREDUMP_BACKEND_LOGGING is not set
CONFIG_DEBUG_COREDUMP_BACKEND_FLASH_PARTITION=y
# CONFIG_DEBUG_COREDUMP_BACKEND_OTHER is not set
# CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_MIN is not set
CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_THREADS=y
# CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_LINKER_RAM is not set
CONFIG_DEBUG_COREDUMP_FLASH_CHUNK_SIZE=64
CONFIG_DEBUG_COREDUMP_THREADS_METADATA=y

I am generating a coredump using this implementation

void trigger_coredump(void)
{
    __ASSERT(0, "Forcing coredump");
}

When i try to read the flash area after generating the coredump with nrfjprog --memrd 0xCF000 --w 32 --n 0x8000
i get all 0xFF

what i am missing ?

I also tried to check myself by replacing CONFIG_DEBUG_COREDUMP_BACKEND_FLASH_PARTITION=y

With CONFIG_DEBUG_COREDUMP_BACKEND_LOGGING=y

Hopping to see the coredump on my open rtt but nothing .. when coredump is triggered prints just stop

What am I missing? Why can't I find a coredump on the flash partition or in the rtt log ?
Can it be that the device does not have the time to write the coredump before the actual crash ? If so, how can I manage that ?
Is there some auto deletion of the flash partition with the coredump so new coredumps can be saved or is it something i have to manage myself after i read the coredump from flash ?

Hope to read you soon

Best regards

Ziv

Top Replies

Parents

0 runsiv 8 months ago

Hi

I will look into your case. Just a quick question to start with. Are you using MCUBOOT also?
Regards

Runar
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 ziv123 8 months ago in reply to runsiv

any news on that ?
Cancel
Vote Up +1 Vote Down

Sign in to reply

Verify Answer

Cancel
0 ziv123 7 months ago in reply to Vidar Berg
hi Vidar

Vidar Berg said:
You need to use LOG_PANIC() to be able to flush the log buffer from the fault handler.

thanks for the LOG_PANIC() it helped me follow what goes in a crash scenario but it brings this questions:

1. according to log ahead, it seems that z_arm_fault precede z_fatal_error() which calls for coredump() and only after that calls for k_sys_fatal_error_handler() .. so i don't understand if memfault's implementation is overwriting this function to handle the crash data collection, it is done after coredump so why does it not get stuck even when asserts are enabled .... is it related to CONFIG_DEBUG_COREDUMP_BACKEND_OTHER though i don't see it used in coredump_backend_flash_partition.c ? or if i want to get to that handler i need to CONFIG_DEBUG_COREDUMP=n and only then i will get to this handler before system hults ? and in it i can call for the coredump save ? ?

2. when working with assert enabled , according to the log at crash i am entering z_arm_fault() twice .. any idea why it is so ?

this is the logs:

with assert enable crash trigger by fault bus: 00> [00000006] <inf> AUGU_PUSH_B[00000007] <inf> os: In z_arm_fault 00> [00000007] <err> os: ***** BUS FAULT ***** 00> [00000007] <err> os: Imprecise data bus error .... 00> [00000007] <err> os: Faulting instruction address (r15/pc): 0x000xxxx 00> [00000008] <err> os: in z_fatal_error 00> [00000008] <err> os: >>> ZEPHYR FATAL ERROR 26: Unknown error on CPU 0 00> [00000008] <err> os: Current thread: 0xFFFFFFFF (augu_work_q) 00> ASSERTION FAIL [((arch_is_in_isr() == 0) || ((timeout).ticks == (((k_timeout_t) {0})).ticks))] @ WEST_TOPDIR/zephyr/kernel/sem.c:136 00> 00> [00000008] <inf> os: In z_arm_fault 00> [00000008] <err> os: ***** HARD FAULT ***** 00> [00000008] <err> os: Fault escalation (see below) 00> [00000008] <err> os: ARCH_EXCEPT with reason 4 00> 00> [00000008] [1;31m<err> os: .... 00> [00000008] <err> os: Faulting instruction address (r15/pc): 0x000XXXX 00> [00000008] <err> os: in z_fatal_error 00> [00000008] [1;31m<err> os: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0 00> [00000008] <err> os: Fault during interrupt handling 00> 00> [00000008] <err> os: Current thread: 0xFFFFFFFF (augu_work_q) 00> ASSERTION FAIL [((arch_is_in_isr() == 0) || ((timeout).ticks == (((k_timeout_t) {0})).ticks))] @ WEST_TOPDIR/zephyr/kernel/sem.c:136

3. when using the coredump_backend_nrf_flash_partition.c (i added some prints in it) i see that in the middle of coredump work i get the print for the event of the button push (i put the crash at start of button push handler) this is strange i think .. any idea why it suddenly pops ?

4. when i use default coredump flow (coredump_backend_flash_partition.c) with CONFIG_ASSERT=n, and i try to crash with k_panic or k_oops, i still fail in saving coredump because of a timeout error at stream_flash_init(..) .. so it seems like default coredump is only supported when BUS FAULT errors ( ZEPHYR FATAL ERROR 26: Unknown error on CPU) .. this is an issue because when an end device in the field get stuck on some scenario i want to reset it in that scenario immediately and not loos this device or waste energy and time in a watchdog feed mechanism .. is there a build in solution for handling dead end scenarios in the end device ? (obviously i want to know that the device failed and what happened before and how it got to this point this is why i need the coredump, usually those scenarios can be device fails to bond or some indication of memory corruption that happened)

5. i still haven't figured out why data isn't saved to flash properly in my use of coredump_backend_nrf_flash_partition.c adding prints in it i do see valid coredump values but for some reason it does not seem to be written to flash partition properly .. will update on that (i also took it to my app instead of using it from within ncs (this way i can keep ncs clean and it is easier if i need to modify the module)

hope to read you soon

best regards

Ziv
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Vidar Berg 7 months ago in reply to ziv123

ziv123 said:
1. when i try to implement my k_sys_fatal_error_handler() in my main it does not overwrite anything, i still get to zephyr error handling and my error handler is never called. what am i missing ?

Please post a code snippet showing how you have implemented this function. You can also check the zephyr.map file for your build to confirm if you were able to redefine the function.

ziv123 said:
2. when working with assert enabled , according to the log at crash i am entering z_arm_fault() twice .. any idea why it is so ?

As mentioned earlier, the CD flash backend is using semaphores (with timeouts) which will raise an assert when invoked from an interrupt context.

ziv123 said:
3. when using the coredump_backend_nrf_flash_partition.c (i added some prints in it) i see that in the middle of coredump work i get the print for the event of the button push (i put the crash at start of button push handler) this is strange i think .. any idea why it suddenly pops ?

I expect all but zero latency interrupts to be masked at this point.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 ziv123 6 months ago in reply to Vidar Berg

Vidar Berg said:
Please post a code snippet showing how you have implemented this function.

i have managed to get it to overwrite, it basically overwrote before as well just it seems that i don't always get to that function being called and this is what i tried to ask here :

ziv123 said:
1. according to log ahead, it seems that z_arm_fault precede z_fatal_error() which calls for coredump() and only after that calls for k_sys_fatal_error_handler() .. so i don't understand if memfault's implementation is overwriting this function to handle the crash data collection, it is done after coredump so why does it not get stuck even when asserts are enabled .... is it related to CONFIG_DEBUG_COREDUMP_BACKEND_OTHER though i don't see it used in coredump_backend_flash_partition.c ? or if i want to get to that handler i need to CONFIG_DEBUG_COREDUMP=n and only then i will get to this handler before system hults ? and in it i can call for the coredump save ? ?

2. i tried to add logs in nrf/modules/memfault-firmware-sdk/memfault_flash_coredump_storage.c to follow how memfault saves the coredump but when memfault is on i do not get any logs, even when i add LOG_PANIC() i just get the first call to z_arm_fault and after that it resets so i am not even sure it gets to the call of k_sys_fatal_error_handler() and if not, then which crash api memfault use to overwrite and take control of the flow ?

p.s.

Vidar Berg said:
I expect all but zero latency interrupts to be masked at this point.

seems like it is loggs latency and not an actual interrupt being called while coredump operation is on

hope to read you soon

best regards

Ziv
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Vidar Berg 6 months ago in reply to ziv123

Yes, if the coredump backend triggers a fault or an assertion program will crash again before reaching your error handler. You can imagine what happens if the error handler calls something that triggers another error and keeps invoking itself. But you should be able to catch faults occurring in the error handler if you debug your code in VS code. A cpu lockup in debug mode will not trigger a reset but instead halt the program:

ziv123 said:
2. i tried to add logs in nrf/modules/memfault-firmware-sdk/memfault_flash_coredump_storage.c to follow how memfault saves the coredump but when memfault is on i do not get any logs, even when i add LOG_PANIC() i just get the first call to z_arm_fault and after that it resets so i am not even sure it gets to the call of k_sys_fatal_error_handler() and if not, then which crash api memfault use to overwrite and take control of the flow ?

The logger will not work in the case of a cpu lockup so it is not possible to rely on logging in this case.

ziv123 said:
seems like it is loggs latency and not an actual interrupt being called while coredump operation is on

When using LOG_MODE_DEFERRED logs are buffered and processed in the logger thread or when you call LOG_PANIC()

EDIT: adding a test sample I created store the stack frame and error reason (you can add other things here too) from the error handler to ram in case you are interested. This is without CD and it will therefore have a lower memory footprint.

crash_save.zip
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 ziv123 6 months ago in reply to Vidar Berg
hi Vidar

well thanks for the patient and the sample

i found the issue with the PR of coredump_backend_nrf_flash_partition.c that caused the data to be saved partially and managed to save CD to flash .. however according to git i see this bug is already fixed (so i'll keep searching for other opportunities for me to contribute )

basically it was this lines

for (uint32_t i = 0; i < size; i += sizeof(uint32_t)) { nrfx_nvmc_word_write(PARTITION_OFFSET + offset + i, UNALIGNED_GET(&data[i])); } data is uint8_t* type and jumping 4 bytes each time in the loop it losses 3 byts of data so my solution and also in git was to cast it like so : nrfx_nvmc_word_write(PARTITION_OFFSET + offset + i, UNALIGNED_GET((const uint32_t *)&data[i]));

now i am a bit struggling with decoding of the coredump .. there are some python scripts in zephyr for that but they don't seem to work so well for example i get this error when running coredump_gdbserver.py

python3 coredump_gdbserver.py /../app/build_b/app/zephyr/zephyr.elf /../coredump/coredump_flash.bin Traceback (most recent call last): File "/.../ncs/v2.8.0/zephyr/scripts/coredump/coredump_gdbserver.py", line 14, in <module> from coredump_parser.elf_parser import CoredumpElfFile File "/.../ncs/v2.8.0/zephyr/scripts/coredump/coredump_parser/elf_parser.py", line 10, in <module> import elftools ModuleNotFoundError: No module named 'elftools'

update: the above can be solve by installing "pip install pyelftools" (there are still errors cause i am not using a log file but a saved bytes hex string like)

is there a script like coredump_serial_log_parser.py that accepts the flash saved bytes instead of a log file to make it binary or is there a way to save CD initially as binary (this can probably save space on flash and less data to transfer OTA)?

another thing is big/little endien .. i notice that in the log the bytes are saved in this order ' 5a450100' (taken from log example in https://docs.zephyrproject.org/latest/services/debugging/coredump.html#:~:text=5a4501000100050000000000)

but in the file i read from flash they are saved like this 0002455A what is the correct way of byte order for it to be parsed ?

i would actually like to have one script that does all (giving it the zephyr.elf and coredump and an output file) and it will output a human readable parsing of coredump .. wonder if there is something like that ??

p.s. if you like i can open the CD parsing questions on a new thread and close this one

hope to read you soon

best regards

Ziv
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 ziv123 6 months ago in reply to Vidar Berg
hi Vidar

well thanks for the patient and the sample

i found the issue with the PR of coredump_backend_nrf_flash_partition.c that caused the data to be saved partially and managed to save CD to flash .. however according to git i see this bug is already fixed (so i'll keep searching for other opportunities for me to contribute )

basically it was this lines

for (uint32_t i = 0; i < size; i += sizeof(uint32_t)) { nrfx_nvmc_word_write(PARTITION_OFFSET + offset + i, UNALIGNED_GET(&data[i])); } data is uint8_t* type and jumping 4 bytes each time in the loop it losses 3 byts of data so my solution and also in git was to cast it like so : nrfx_nvmc_word_write(PARTITION_OFFSET + offset + i, UNALIGNED_GET((const uint32_t *)&data[i]));

now i am a bit struggling with decoding of the coredump .. there are some python scripts in zephyr for that but they don't seem to work so well for example i get this error when running coredump_gdbserver.py

python3 coredump_gdbserver.py /../app/build_b/app/zephyr/zephyr.elf /../coredump/coredump_flash.bin Traceback (most recent call last): File "/.../ncs/v2.8.0/zephyr/scripts/coredump/coredump_gdbserver.py", line 14, in <module> from coredump_parser.elf_parser import CoredumpElfFile File "/.../ncs/v2.8.0/zephyr/scripts/coredump/coredump_parser/elf_parser.py", line 10, in <module> import elftools ModuleNotFoundError: No module named 'elftools'

update: the above can be solve by installing "pip install pyelftools" (there are still errors cause i am not using a log file but a saved bytes hex string like)

is there a script like coredump_serial_log_parser.py that accepts the flash saved bytes instead of a log file to make it binary or is there a way to save CD initially as binary (this can probably save space on flash and less data to transfer OTA)?

another thing is big/little endien .. i notice that in the log the bytes are saved in this order ' 5a450100' (taken from log example in https://docs.zephyrproject.org/latest/services/debugging/coredump.html#:~:text=5a4501000100050000000000)

but in the file i read from flash they are saved like this 0002455A what is the correct way of byte order for it to be parsed ?

i would actually like to have one script that does all (giving it the zephyr.elf and coredump and an output file) and it will output a human readable parsing of coredump .. wonder if there is something like that ??

p.s. if you like i can open the CD parsing questions on a new thread and close this one

hope to read you soon

best regards

Ziv
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

0 ziv123 6 months ago in reply to ziv123
well, feeling a little neglected here

anyway things i figured out so far:

1. what is saved to flash is actually a binary, but when using " nefjprog --memrd " to read it and save to a file it is converted to ascci hex (this can also be varified by the size of the saved file which is larger then the actual flash partition). to save the binary from flash this commands can be used:

JLinkExe -device nRF52840_xxAA -if swd -speed 4000 -autoconnect 1

followed by : savebin coredump.bin,0xED000,0x8000 from within JLink.

2. following instruction for coredump debug, here https://docs.zephyrproject.org/latest/services/debugging/coredump.html#:~:text=Run%20the%20core%20dump%20serial%20log%20converter%3A or here https://docs.nordicsemi.com/bundle/ncs-2.8.0/page/zephyr/services/debugging/coredump.html#:~:text=Run%20the%20core%20dump%20serial%20log%20converter%3A

the py script coredump_serial_log_parser.py is designed to take a .log file (not sure where i get it, maybe by copying my rtt output remove the timestamp and log file name etc. all the way to '#CD' and save to a file ) .. any way, since i know how to get a binary i don't need this converter which is not really working with normal coredump cause normal coredump is missing the "BEGIN" bytes or the "#CD" and also contain a header before the CD initials 'ZE' .. so ..i can take my binary coredump parse the first 4 words to the header, get the size of coredump and then saved this size of bytes from after the header .. the reason that the bin file should not contain the header is that coredump_gdbserver.py calls for log_parser.py which expects to find 'ZE' in the first line.

p.s. in the CD saved to flash the bytes seems to be in reverse order from the way it seems in the log prints in the rtt, i think it is just a matter of how the tools present the ascii cause the bin file without any reverse operation is read by the log_parser.py and 'ZE' is found.

so far this is what i got working and understood.. Vidar Berg please correct me if i am wrong somewhere there.

the issue i am running this "/opt/zephyr-sdk/arm-zephyr-eabi/bin/arm-zephyr-eabi-gdb build/zephyr/zephyr.elf" (instead of non existing <path to SDK>/x86_64-zephyr-elf/bin/x86_64-zephyr-elf-gdb ) and trying to connect, then i get this error

(gdb) target remote localhost:1234 localhost:1234 : cannot resolve name: Servname not supported for ai_socktype localhost:1234 : No such file or directory.

even though coredump_gdbserver.py is running on another terminal

any idea why ? update - seems like it was related to order of things server must run before opening the gdb client even if the attempt to conect inside gdb is after the server start running ..

are there more debug options other then "info registers" and "bt" in order to get more information on running threads and stack ?

i'll mention that i am working with a coredump bin which is generated with normal building mechanism (

CONFIG_DEBUG_COREDUMP=y,

CONFIG_DEBUG_COREDUMP_BACKEND_FLASH_PARTITION=y,

CONFIG_DEBUG_COREDUMP_MEMORY_DUMP_THREADS=y,

CONFIG_ASSERT=n )

and i'll also mention that it is very strange to me that there is no clear tool or support for flash saved CD parsing / debugging

hope to read you soon

best regards

Ziv
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel