MPU Fault with NCS Zephyr on nrf52840s

I'm getting the following intermittent MPU fault when reading data over BLE.

Both the BLE transmitter and receiver are using an nrf52840, though the receiver is a Laird MG100.

[01:39:56.263,153] <err> os: mem_manage_fault: ***** MPU FAULT *****
[01:39:56.270,446] <err> os: mem_manage_fault:   Data Access Violation
[01:39:56.277,923] <err> os: mem_manage_fault:   MMFAR Address: 0x0
[01:39:56.285,156] <err> os: esf_dump: r0/a1:  0x00000000  r1/a2:  0x00000000  r2/a3:  0x00000040
[01:39:56.295,074] <err> os: esf_dump: r3/a4:  0x20000b14 r12/ip:  0x200010a0 r14/lr:  0x00061195
[01:39:56.304,992] <err> os: esf_dump:  xpsr:  0x610f0000
[01:39:56.311,401] <err> os: esf_dump: Faulting instruction address (r15/pc): 0x00015924
[01:39:56.320,495] <err> os: z_fatal_error: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[01:39:56.329,925] <err> os: z_fatal_error: Current thread: 0x20003cd0 (main)
[01:39:56.338,073] <err> fatal_error: k_sys_fatal_error_handler: Resetting system

I'm using Zephyr OS build v3.2.99-ncs2.

I've traced r14/lr to `/nordic/v2.3.0/zephyr/kernel/sched.c:884`:

struct k_thread *z_unpend_first_thread(_wait_q_t *wait_q)
{
	struct k_thread *thread = NULL;

	LOCKED(&sched_spinlock) {
		thread = _priq_wait_best(&wait_q->waitq);

		if (thread != NULL) {                  <---------- THIS IS THE OFFENDING LINE
			unpend_thread_no_timeout(thread);
			(void)z_abort_thread_timeout(thread);
		}
	}

	return thread;
}

I've traced the faulting instruction address (r15/pc) to `/nordic\toolchains\v2.3.0\opt\zephyr-sdk\arm-zephyr-eabi\arm-zephyr-eabi\sys-include\ssp/string.h:86`:

...
__ssp_bos_icheck3(memset, void *, int)
...

I'm unsure how to further debug this, as this appears to pertain to code working at a much lower level than I'm really used to.

Can anyone provide guidance?

Parents

0 Håkon Alseth over 2 years ago

Hi,

Looks like the current thread is out of memory. Have you tried increasing CONFIG_MAIN_STACK_SIZE to see if this helps the scenario?

Kind regards,

Håkon
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 scath over 2 years ago in reply to Håkon Alseth

Apologies for the late reply.

Have doubled the available memory and begun testing. Will report back asap.

Thanks!
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Håkon Alseth over 2 years ago in reply to scath

Hi,

scath said:
Apologies for the late reply.

No worries.

scath said:
Have doubled the available memory and begun testing.

Let me know if the issue still occurs

Kind regards,

Håkon
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 scath over 2 years ago in reply to Håkon Alseth

Hi Hakon,

This hasn't helped - I still get the same error at the same rate (roughly once per hour).
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 scath over 2 years ago in reply to Håkon Alseth

Hi Hakon,

This hasn't helped - I still get the same error at the same rate (roughly once per hour).
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

0 Håkon Alseth over 2 years ago in reply to scath
Hi,

Q1: Is the assert message still similar? addr2line points to a memset() call?

Q2: Have you tried to see where the RAM mapped registers (ie. those with 0x2000xxxx) point to? This is unfortunately a manual operation by looking into the build-folder/zephyr/zephyr.map file

Q3: Could you set this configuration?

CONFIG_RESET_ON_FATAL_ERROR=n

This will ensure that you do not reset when a fault occurs, so that you can connect the debugger and see the callstack (in addition to the assert output). Please share this, to see if there's anything there to help us backtrack where this occurs.

Kind regards,

Håkon
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 scath over 1 year ago in reply to Håkon Alseth

Hi,

We recently changed settings related to TLS connections, and this seems to have changed the nature of the problem.

I no longer get MPU Faults pointing to sched.c. Instead, the issue now appears to emanate from libc-hooks.c and uart_nrfx_uart.c. This new issue is regular: it repeats at roughly the same interval as the new issue, but the offending line has been consistently the same over multiple tests.

I'm thinking that the original issue can be put on hold while I investigate the new issue.

I'm wondering if I should create a new ticket?

With thanks,

S.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Håkon Alseth over 1 year ago in reply to scath

Hi,

Faults tend to move when changing the firmware, as the timing also changes when introducing changes, so this is to be expected.

Feel free to share information related to any new faults that occurs.

Could you try to look at the thread stack usage and see if any of your threads are "close to the limit" at any point? This can give an indication on which thread could be the culprit.

https://docs.zephyrproject.org/latest/services/debugging/thread-analyzer.html

Kind regards,

Håkon
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel