MPU Fault with NCS Zephyr on nrf52840s

I'm getting the following intermittent MPU fault when reading data over BLE. 

Both the BLE transmitter and receiver are using an nrf52840, though the receiver is a Laird MG100.

[01:39:56.263,153] <err> os: mem_manage_fault: ***** MPU FAULT *****
[01:39:56.270,446] <err> os: mem_manage_fault:   Data Access Violation
[01:39:56.277,923] <err> os: mem_manage_fault:   MMFAR Address: 0x0
[01:39:56.285,156] <err> os: esf_dump: r0/a1:  0x00000000  r1/a2:  0x00000000  r2/a3:  0x00000040
[01:39:56.295,074] <err> os: esf_dump: r3/a4:  0x20000b14 r12/ip:  0x200010a0 r14/lr:  0x00061195
[01:39:56.304,992] <err> os: esf_dump:  xpsr:  0x610f0000
[01:39:56.311,401] <err> os: esf_dump: Faulting instruction address (r15/pc): 0x00015924
[01:39:56.320,495] <err> os: z_fatal_error: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[01:39:56.329,925] <err> os: z_fatal_error: Current thread: 0x20003cd0 (main)
[01:39:56.338,073] <err> fatal_error: k_sys_fatal_error_handler: Resetting system

I'm using Zephyr OS build v3.2.99-ncs2.

I've traced r14/lr to `/nordic/v2.3.0/zephyr/kernel/sched.c:884`:

struct k_thread *z_unpend_first_thread(_wait_q_t *wait_q)
{
	struct k_thread *thread = NULL;

	LOCKED(&sched_spinlock) {
		thread = _priq_wait_best(&wait_q->waitq);

		if (thread != NULL) {                  <---------- THIS IS THE OFFENDING LINE
			unpend_thread_no_timeout(thread);
			(void)z_abort_thread_timeout(thread);
		}
	}

	return thread;
}


I've traced the faulting instruction address (r15/pc) to `/nordic\toolchains\v2.3.0\opt\zephyr-sdk\arm-zephyr-eabi\arm-zephyr-eabi\sys-include\ssp/string.h:86`:
...
__ssp_bos_icheck3(memset, void *, int)
...


I'm unsure how to further debug this, as this appears to pertain to code working at a much lower level than I'm really used to.

Can anyone provide guidance?

Parents Reply Children
  • Hi,

     

    Q1: Is the assert message still similar? addr2line points to a memset() call?

    Q2: Have you tried to see where the RAM mapped registers (ie. those with 0x2000xxxx) point to? This is unfortunately a manual operation by looking into the build-folder/zephyr/zephyr.map file

    Q3: Could you set this configuration?

    CONFIG_RESET_ON_FATAL_ERROR=n

    This will ensure that you do not reset when a fault occurs, so that you can connect the debugger and see the callstack (in addition to the assert output). Please share this, to see if there's anything there to help us backtrack where this occurs.

     

    Kind regards,

    Håkon

  • Hi,

    We recently changed settings related to TLS connections, and this seems to have changed the nature of the problem. 

    I no longer get MPU Faults pointing to sched.c. Instead, the issue now appears to emanate from libc-hooks.c and uart_nrfx_uart.c. This new issue is regular: it repeats at roughly the same interval as the new issue, but the offending line has been consistently the same over multiple tests.

    I'm thinking that the original issue can be put on hold while I investigate the new issue.

    I'm wondering if I should create a new ticket?

    With thanks,

    S.

  • Hi,

     

    Faults tend to move when changing the firmware, as the timing also changes when introducing changes, so this is to be expected.

    Feel free to share information related to any new faults that occurs.

     

    Could you try to look at the thread stack usage and see if any of your threads are "close to the limit" at any point? This can give an indication on which thread could be the culprit.

    https://docs.zephyrproject.org/latest/services/debugging/thread-analyzer.html

      

    Kind regards,

    Håkon

Related