Beware that this post is related to an SDK in maintenance mode
More Info: Consider nRF Connect SDK for new designs
This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

softdevice: Hard Fault / Advertising flash contents on extended advertising

On the latest master of https://github.com/nrfconnect/sdk-nrf (v1.7.0-rc1), swapping between extended and standard advertising will result in reads/ attempted writes to address 0x00.

Depending on the application configuration, this will result in either a hard fault, or advertising the contents of flash at 0x00 instead of the expected AD structures.

I have attached a modification of the standard scan_adv Zephyr sample which demonstrates this issue consistently on a Particle Argon board within the first few seconds.

With CONFIG_MPU_ALLOW_FLASH_WRITE enabled, the sample will advertise flash contents, if it is disabled it will hard fault inside the softdevice with:

[00:00:02.017,028] <err> os: ***** MPU FAULT *****
[00:00:02.017,059] <err> os:   Data Access Violation
[00:00:02.017,059] <err> os:   MMFAR Address: 0x0
[00:00:02.017,059] <err> os: r0/a1:  0x00000000  r1/a2:  0x20008ec1  r2/a3:  0x0000000d
[00:00:02.017,059] <err> os: r3/a4:  0xffffffff r12/ip:  0x00000002 r14/lr:  0x0001a955
[00:00:02.017,059] <err> os:  xpsr:  0x01000000
[00:00:02.017,059] <err> os: Faulting instruction address (r15/pc): 0x000166f2
[00:00:02.017,089] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:02.017,089] <err> os: Current thread: 0x20001f38 (unknown)

Softdevice version information:

[00:00:00.006,347] <inf> sdc_hci_driver: SoftDevice Controller build revision:
                                         3f 47 70 8e 81 95 4e 86  9d d3 a2 95 88 f6 30 0a |?Gp...N. ......0.
                                         7f 53 49 fd                                      |.SI.
[00:00:00.007,751] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:00.007,751] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:00.007,781] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 63.28743 Build 1318420878
[00:00:00.008,483] <inf> bt_hci_core: Identity: DA:04:3D:69:EA:9B (random)
[00:00:00.008,483] <inf> bt_hci_core: HCI: version 5.2 (0x0b) revision 0x125b, manufacturer 0x0059
[00:00:00.008,483] <inf> bt_hci_core: LMP: version 5.2 (0x0b) subver 0x125b

This issue can be worked around by destroying and recreating the advertising set each time advertising is started, however I cannot validate if this simply makes the error less frequent.

As an example of received packets on a remote device of the sample application, where the 05 FF packets are valid and 28 64 00 20 packets are the flash contents at 0x00 (Stack pointer):

[02:48:22.000,061] <inf> app: 01/01/2020 02:48:59.00002
[02:48:23.000,061] <inf> app: 01/01/2020 02:49:00.00002
[02:48:23.717,254] <wrn> uc_bt_adv: ADV PKT
                                    28 64 00 20 11 ac                                |(d. ..
[02:48:24.000,061] <inf> app: 01/01/2020 02:49:01.00000
[02:48:24.722,106] <wrn> uc_bt_adv: ADV PKT
                                    05 ff ff ff d5 00                                |......
[02:48:25.000,030] <inf> app: 01/01/2020 02:49:02.00000
[02:48:25.720,336] <wrn> uc_bt_adv: ADV PKT
                                    05 ff ff ff d7 00                                |......
[02:48:26.000,030] <inf> app: 01/01/2020 02:49:03.00000
[02:48:26.723,693] <wrn> uc_bt_adv: ADV PKT
                                    28 64 00 20 11 ac                                |(d. ..

scan_adv.zip

  • Hi Jordan, 

    Thank you for the detailed description and sorry for the delayed reply.

    Have you tried this with nRF Connect SDK v1.6.1?

    Best regards, 

    Håkon

  • Not specifically, but the issue only appeared when updating the softdevice from https://github.com/nrfconnect/sdk-nrfxlib/commit/1a894638c438ce738c101b2c35894c9ffa2beb95 to the v1.7.0-rc1 commit.

    If I had to guess I would think its related to the fix for:

    * Fixed an issue where an assert may occur when legacy advertiser is used after "HCI LE Clear Advertising Sets" (DRGN-15993).

    In https://github.com/nrfconnect/sdk-nrfxlib/commit/f8c4cbb42dbacd2a8b2ac62d7a26b42d0dc32b7c

  • Hi Jordan, thank you very much for reporting this. I have notified the software team who will look into this issue. 

    Best regards, 

    Håkon

  • This seems to be an application thread overflow, it is not a SoftDevice assert. It is an MPU fault.

    Please check the thread with address 0x20001f38 as indicated in the log below. You can check the map file (build/zephyr/zephyr.map) or alternatively enable CONFIG_THREAD_NAME=y

    [00:00:02.017,028] <err> os: ***** MPU FAULT *****
    [00:00:02.017,059] <err> os:   Data Access Violation
    [00:00:02.017,059] <err> os:   MMFAR Address: 0x0
    [00:00:02.017,059] <err> os: r0/a1:  0x00000000  r1/a2:  0x20008ec1  r2/a3:  0x0000000d
    [00:00:02.017,059] <err> os: r3/a4:  0xffffffff r12/ip:  0x00000002 r14/lr:  0x0001a955
    [00:00:02.017,059] <err> os:  xpsr:  0x01000000
    [00:00:02.017,059] <err> os: Faulting instruction address (r15/pc): 0x000166f2
    [00:00:02.017,089] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
    [00:00:02.017,089] <err> os: Current thread: 0x20001f38 (unknown) 

    The reason for the assert to occur after updating NCS (SoftDevice controller) may have to do with a change in stack usage between versions.

  • I don't believe it's a stack overflow. I can manually trigger a stack overflow with the following function and `CONFIG_STACK_SENTINEL=y`, which outputs logs like:

    void stack_overflow(void)
    {
    	uint8_t *stack_pointer = (uint8_t *)(_current->stack_info.start + _current->stack_info.size / 2);
    
    	/* Write over complete stack */
    	while ((uintptr_t)stack_pointer > _current->stack_info.start) {
    		(*stack_pointer)++;
    		stack_pointer--;
    	}
    	/* Sleep to force context switch, triggering exception */
    	k_sleep(K_MSEC(100));
    }

    Starting Fault Demo
    Bluetooth initialized
    [00:00:01.011,932] <err> os: r0/a1:  0x00000002  r1/a2:  0x00008e54  r2/a3:  0xf0f0f0f0
    [00:00:01.011,932] <err> os: r3/a4:  0x20004da0 r12/ip:  0x20001af8 r14/lr:  0x0001df5d
    [00:00:01.011,932] <err> os:  xpsr:  0x61000000
    [00:00:01.011,932] <err> os: Faulting instruction address (r15/pc): 0x0001c886
    [00:00:01.011,932] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
    [00:00:01.011,962] <err> os: Current thread: 0x20001e98 (unknown)
    [00:00:01.056,732] <err> os: Halting system

    Removing the manual stock overflow still outputs the "MPU FAULT" error.

    Unless I misunderstand how the SoftDevice asserts are generated, attempting to write to 0x00 in the SoftDevice would still generate an MPU FAULT, not a SoftDevice assert. Unlike the binary blob variant of the Softdevice, I don't think the static library version intercepts the vector table.

    Furthermore, more than doubling the stack size of the thread which the assert occurs in with `CONFIG_BT_HCI_TX_STACK_SIZE=4096` does not change the behaviour.

Related