Probability of HARD FAULT occurring during the BLE connection process

666 3 months ago

Hi all,

During the Bluetooth connection process, there is a small probability of the following phenomena occurring (both of which are the same firmware, with the similarity being that I2C will read data every 50ms at this time. I don't know if this will have an impact). In each of these phenomena, there is an assert, but no log corresponding to the number of rows for the asset is seen (CONFIG_RESET_ON_FATAL_ERROR=n CONFIG_ASSERT=y). Do you have any suggestions for me to eliminate the following phenomena.

Error Phenomenon 1：

13> [00:08:29.757,840] <err> os: ***** HARD FAULT *****
13> [00:08:29.757,979] <err> os: Fault escalation (see below)
13> [00:08:29.758,123] <err> os: ARCH_EXCEPT with reason 4
13>
13> [00:08:29.758,272] <err> os: r0/a1: 0x00000004 r1/a2: 0x0000013b r2/a3: 0x00000026
13> [00:08:29.758,465] <err> os: r3/a4: 0x00000004 r12/ip: 0x00027961 r14/lr: 0x00056bd7
13> [00:08:29.758,651] <err> os: xpsr: 0x290000f5
13> [00:08:29.758,787] <err> os: Faulting instruction address (r15/pc): 0x0006b164
13> [00:08:29.758,963] <err> os: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
13> [00:08:29.759,129] <err> os: Fault during interrupt handling
13>
13> [00:08:29.759,271] <err> os: Current thread: 0x200053e0 (main)
13> [00:08:29.759,417] [1;31m<err> os: Halting system[0m

Note:

addr2line -e zephyr.elf -f 0x0006b164
assert_post_action
zephyr/lib/os/assert.c:44
addr2line -e zephyr.elf -f 0x00056bd7
sdc_assertion_handler
nrf/subsys/bluetooth/controller/hci_driver.c:315
Error Phenomenon 2：

13> [00:03:55.820,388] <err> os: ***** HARD FAULT *****
13> [00:03:55.820,527] <err> os: Fault escalation (see below)
13> [00:03:55.820,671] <err> os: ARCH_EXCEPT with reason 4
13>
13> [00:03:55.820,820] <err> os: r0/a1: 0x00000004 r1/a2: 0x00000133 r2/a3: 0x00000015
13> [00:03:55.821,014] <err> os: r3/a4: 0x00000004 r12/ip: 0x007072a0 r14/lr: 0x0005785f
13> [00:03:55.821,202] <err> os: xpsr: 0x210000f5
13> [00:03:55.821,339] <err> os: Faulting instruction address (r15/pc): 0x0006b164
13> [00:03:55.821,519] <err> os: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
13> [00:03:55.821,695] <err> os: Fault during interrupt handling
13>
13> [00:03:55.821,846] <err> os: Current thread: 0x20005318 (idle)
13> [00:03:55.822,000] [1;31m<err> os: Halting system[0m

Note:

addr2line -e zephyr.elf -f 0x0006b164
assert_post_action
zephyr/lib/os/assert.c:44
addr2line -e zephyr.elf -f 0x0005785f
m_assert_handler
nrf/subsys/mpsl/init/mpsl_init.c:307

V3.1.0,NRF54L15

Looking forward to your reply!

Parents

0 Susheel Nuguru 2 months ago

I do not think using such a low priority for TWIM transactions directly causes the MPSL assert. I think the hardfault logs are bit too vague/generic to describe the problem.

If we suspect that this in fact is an MPSL error, then try to enable CONFIG_MPSL_ASSERT_HANDLER=y in your prj.conf and implement your mpsl assert handler something like below

#include <mpsl/mpsl_assert.h>

void mpsl_assert_handle(char *file, uint32_t line)
{
	LOG_ERR("MPSL ASSERT: %s, %u", file ? file : "<null>", line);
	k_panic();
}

This is also explained in NCS/nrf/doc/nrf/libraries/mpsl/mpsl_assert.rst

The Multiprotocol Service Layer assert library makes it possible to add a custom assert handler to the :ref:`nrfxlib:mpsl` library.
You can then use this assert handler to print custom error messages or log assert information.

:kconfig:option:`CONFIG_MPSL_ASSERT_HANDLER` enables the custom assert handler.
If enabled, the application must provide the definition of :c:func:`mpsl_assert_handle`.
The :c:func:`mpsl_assert_handle` function is invoked whenever the MPSL code encounters an unrecoverable error.

API documentation
*****************

| Header file: :file:`include/mpsl/mpsl_assert.h`

0 666 2 months ago in reply to Susheel Nuguru

HI

Sorry for replying to you only now. The reason is that this issue is difficult to reproduce, so it took a long time. During this period, I obtained two phenomena.
The first phenomenon is that the above logs have not been added yet, but the watchdog has been turned off in order to be able to use VS code to view the scene. However, this time there were no error logs when the problem occurred. Then I connected the board to another computer's jlink and used the "Attach Debugger to Tager" function of VS code. I saw the following situation, but I cannot analyze it and am not sure if this method of capturing logs is reliable

JLinkGDBServerCL: =thread-group-added,id="i1" =cmd-param-changed,param="pagination",value="off" arch_system_halt (reason=reason@entry=31) at E:/ncs/v3.1.0/zephyr/kernel/fatal.c:30 30 for (;;) { Program received signal SIGTRAP, Trace/breakpoint trap. arch_system_halt (reason=reason@entry=31) at E:/ncs/v3.1.0/zephyr/kernel/fatal.c:30 30 for (;;) {

The second phenomenon is that the above log was added to the main, but the watchdog was still turned off, but the device still restarted without any error logs. However, I saw that the reason for the restart was Reset from CPU lockout detected.
Can you help me explain the above phenomenon, or do you have any suggestions? I have no direction left.
Looking forward to your reply!
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 666 2 months ago in reply to Susheel Nuguru

HI

Sorry for replying to you only now. The reason is that this issue is difficult to reproduce, so it took a long time. During this period, I obtained two phenomena.
The first phenomenon is that the above logs have not been added yet, but the watchdog has been turned off in order to be able to use VS code to view the scene. However, this time there were no error logs when the problem occurred. Then I connected the board to another computer's jlink and used the "Attach Debugger to Tager" function of VS code. I saw the following situation, but I cannot analyze it and am not sure if this method of capturing logs is reliable

JLinkGDBServerCL: =thread-group-added,id="i1" =cmd-param-changed,param="pagination",value="off" arch_system_halt (reason=reason@entry=31) at E:/ncs/v3.1.0/zephyr/kernel/fatal.c:30 30 for (;;) { Program received signal SIGTRAP, Trace/breakpoint trap. arch_system_halt (reason=reason@entry=31) at E:/ncs/v3.1.0/zephyr/kernel/fatal.c:30 30 for (;;) {

The second phenomenon is that the above log was added to the main, but the watchdog was still turned off, but the device still restarted without any error logs. However, I saw that the reason for the restart was Reset from CPU lockout detected.
Can you help me explain the above phenomenon, or do you have any suggestions? I have no direction left.
Looking forward to your reply!
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

No Data