Beware that this post is related to an SDK in maintenance mode
More Info: Consider nRF Connect SDK for new designs

This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

softdevice: Hard Fault / Advertising flash contents on extended advertising

On the latest master of https://github.com/nrfconnect/sdk-nrf (v1.7.0-rc1), swapping between extended and standard advertising will result in reads/ attempted writes to address 0x00.

Depending on the application configuration, this will result in either a hard fault, or advertising the contents of flash at 0x00 instead of the expected AD structures.

I have attached a modification of the standard scan_adv Zephyr sample which demonstrates this issue consistently on a Particle Argon board within the first few seconds.

With CONFIG_MPU_ALLOW_FLASH_WRITE enabled, the sample will advertise flash contents, if it is disabled it will hard fault inside the softdevice with:

[00:00:02.017,028] <err> os: ***** MPU FAULT *****
[00:00:02.017,059] <err> os:   Data Access Violation
[00:00:02.017,059] <err> os:   MMFAR Address: 0x0
[00:00:02.017,059] <err> os: r0/a1:  0x00000000  r1/a2:  0x20008ec1  r2/a3:  0x0000000d
[00:00:02.017,059] <err> os: r3/a4:  0xffffffff r12/ip:  0x00000002 r14/lr:  0x0001a955
[00:00:02.017,059] <err> os:  xpsr:  0x01000000
[00:00:02.017,059] <err> os: Faulting instruction address (r15/pc): 0x000166f2
[00:00:02.017,089] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:02.017,089] <err> os: Current thread: 0x20001f38 (unknown)

Softdevice version information:

[00:00:00.006,347] <inf> sdc_hci_driver: SoftDevice Controller build revision:
                                         3f 47 70 8e 81 95 4e 86  9d d3 a2 95 88 f6 30 0a |?Gp...N. ......0.
                                         7f 53 49 fd                                      |.SI.
[00:00:00.007,751] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:00.007,751] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:00.007,781] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 63.28743 Build 1318420878
[00:00:00.008,483] <inf> bt_hci_core: Identity: DA:04:3D:69:EA:9B (random)
[00:00:00.008,483] <inf> bt_hci_core: HCI: version 5.2 (0x0b) revision 0x125b, manufacturer 0x0059
[00:00:00.008,483] <inf> bt_hci_core: LMP: version 5.2 (0x0b) subver 0x125b

This issue can be worked around by destroying and recreating the advertising set each time advertising is started, however I cannot validate if this simply makes the error less frequent.

As an example of received packets on a remote device of the sample application, where the 05 FF packets are valid and 28 64 00 20 packets are the flash contents at 0x00 (Stack pointer):

[02:48:22.000,061] <inf> app: 01/01/2020 02:48:59.00002
[02:48:23.000,061] <inf> app: 01/01/2020 02:49:00.00002
[02:48:23.717,254] <wrn> uc_bt_adv: ADV PKT
                                    28 64 00 20 11 ac                                |(d. ..
[02:48:24.000,061] <inf> app: 01/01/2020 02:49:01.00000
[02:48:24.722,106] <wrn> uc_bt_adv: ADV PKT
                                    05 ff ff ff d5 00                                |......
[02:48:25.000,030] <inf> app: 01/01/2020 02:49:02.00000
[02:48:25.720,336] <wrn> uc_bt_adv: ADV PKT
                                    05 ff ff ff d7 00                                |......
[02:48:26.000,030] <inf> app: 01/01/2020 02:49:03.00000
[02:48:26.723,693] <wrn> uc_bt_adv: ADV PKT
                                    28 64 00 20 11 ac                                |(d. ..

scan_adv.zip

Top Replies

helsing over 3 years ago in reply to JordanYates +1 verified

We can confirm there is a bug in SDC in v1.7.0-rc1 (and v1.7.1 as well). The module that allocates adv buffers seems to have a bug so it returns NULL pointer when it should have not. That module was completely…

0 helsing over 3 years ago

Hi Jordan,

Thank you for the detailed description and sorry for the delayed reply.

Have you tried this with nRF Connect SDK v1.6.1?

Best regards,

Håkon
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 JordanYates over 3 years ago in reply to helsing

Not specifically, but the issue only appeared when updating the softdevice from https://github.com/nrfconnect/sdk-nrfxlib/commit/1a894638c438ce738c101b2c35894c9ffa2beb95 to the v1.7.0-rc1 commit.

If I had to guess I would think its related to the fix for:

* Fixed an issue where an assert may occur when legacy advertiser is used after "HCI LE Clear Advertising Sets" (DRGN-15993).

In https://github.com/nrfconnect/sdk-nrfxlib/commit/f8c4cbb42dbacd2a8b2ac62d7a26b42d0dc32b7c
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 helsing over 3 years ago in reply to JordanYates

Hi Jordan, thank you very much for reporting this. I have notified the software team who will look into this issue.

Best regards,

Håkon
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

0 helsing over 3 years ago in reply to JordanYates

This seems to be an application thread overflow, it is not a SoftDevice assert. It is an MPU fault.

Please check the thread with address 0x20001f38 as indicated in the log below. You can check the map file (build/zephyr/zephyr.map) or alternatively enable CONFIG_THREAD_NAME=y

[00:00:02.017,028] <err> os: ***** MPU FAULT *****

[00:00:02.017,059] <err> os:   Data Access Violation

[00:00:02.017,059] <err> os:   MMFAR Address: 0x0

[00:00:02.017,059] <err> os: r0/a1:  0x00000000  r1/a2:  0x20008ec1  r2/a3:  0x0000000d

[00:00:02.017,059] <err> os: r3/a4:  0xffffffff r12/ip:  0x00000002 r14/lr:  0x0001a955

[00:00:02.017,059] <err> os:  xpsr:  0x01000000

[00:00:02.017,059] <err> os: Faulting instruction address (r15/pc): 0x000166f2

[00:00:02.017,089] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0

[00:00:02.017,089] <err> os: Current thread: 0x20001f38 (unknown)

The reason for the assert to occur after updating NCS (SoftDevice controller) may have to do with a change in stack usage between versions.

0 JordanYates over 3 years ago in reply to helsing

I don't believe it's a stack overflow. I can manually trigger a stack overflow with the following function and `CONFIG_STACK_SENTINEL=y`, which outputs logs like:

void stack_overflow(void)
{
	uint8_t *stack_pointer = (uint8_t *)(_current->stack_info.start + _current->stack_info.size / 2);

	/* Write over complete stack */
	while ((uintptr_t)stack_pointer > _current->stack_info.start) {
		(*stack_pointer)++;
		stack_pointer--;
	}
	/* Sleep to force context switch, triggering exception */
	k_sleep(K_MSEC(100));
}

Starting Fault Demo
Bluetooth initialized
[00:00:01.011,932] <err> os: r0/a1:  0x00000002  r1/a2:  0x00008e54  r2/a3:  0xf0f0f0f0
[00:00:01.011,932] <err> os: r3/a4:  0x20004da0 r12/ip:  0x20001af8 r14/lr:  0x0001df5d
[00:00:01.011,932] <err> os:  xpsr:  0x61000000
[00:00:01.011,932] <err> os: Faulting instruction address (r15/pc): 0x0001c886
[00:00:01.011,932] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
[00:00:01.011,962] <err> os: Current thread: 0x20001e98 (unknown)
[00:00:01.056,732] <err> os: Halting system

Removing the manual stock overflow still outputs the "MPU FAULT" error.

Unless I misunderstand how the SoftDevice asserts are generated, attempting to write to 0x00 in the SoftDevice would still generate an MPU FAULT, not a SoftDevice assert. Unlike the binary blob variant of the Softdevice, I don't think the static library version intercepts the vector table.

Furthermore, more than doubling the stack size of the thread which the assert occurs in with `CONFIG_BT_HCI_TX_STACK_SIZE=4096` does not change the behaviour.