Long running bt_conn_le_create_auto() breaks Bluetooth controller (nRF5340)

Hi,

Background

For some time now I have been troubled by a bug in one of our products acting as a BLE central device. The bug manifests itself as the central reporting that it is connected to several peripherals, but it receives no data from the peripherals and no disconnect events are generated when the peripherals are power-cycled. The peripherals themselves advertise, indicating that they are in fact not connected to any central.

The central works by receiving a list of assigned periphal devices to which it maintains connections at all times. All peripherals are added to the filter accept list (FAL) upon assignment, and a connection supervision routine calls `bt_conn_le_create_auto()` when needed to connect/reconnect to the assigned peripherals.

Our naive use of the FAL and `bt_conn_le_create_auto()` means that if a device is assigned, but physically unavailable, we will forever have an outstanding `bt_conn_le_create_auto()` attempting to establish a connection to that unavailable node. This long running `bt_conn_le_create_auto()` attempt is what is causing the failure.

Description and How to Reproduce

I have found a small set of steps that can reliably reproduce the bug:

1. Set up FAL of to contain 3 addresses:
   - two devices running the peripheral_hr Zephyr sample on nRF5340 hardware
   - one address corresponding to a physically unavailable device
2. Continually call `bt_conn_le_create_auto()` whenever a device disconnects or reconnects.
3. Wait 50 minutes then power-cycle one of the peripherals.
4. The central will notice a disconnection event for the power-cycled device, but the second device will be silently dropped and outstanding `bt_conn_le_create_auto()` initiation attempt will not be able to connect to any more devices.
   
In the resulting state, our experience is that the BT controller will no longer function as expected and many BT operations results in assertion errors. This part is not demonstrated by the included sample.

See attached a sample application for nRF5340DK using latest NCS release (v3.2.4). See attached patch file (apply to the Zephyr repo at tag `ncs-v3.2.4`).

Additional Information

Our production application differs from the sample in significant ways. It's built on an older version of NCS (2.5.1), uses MPSL to run a proprietery protocol in addition to BLE, uses different connection and scan intervals. The common denominator is the long outstanding `bt_conn_le_create_auto()`. Still it takes 50 minutes (exact figure unknown but 40 minutes is too short) before the bug can be triggered.

It seems the problom can be worked around by avoiding long running `bt_conn_le_create_auto()` requests. By stopping with `bt_conn_create_auto_stop()` and restarting, the bug disappears. Calling `bt_conn_create_auto_stop()` and resuming works to avoid the bug even if 50 minutes has already passed.

Request

I would like a Nordic engineer to investigate this issue and determine whether this is a bug in the SoftDevice Controller (SDC) that requires a fix, or if the observed behavior is expected. If it is expected or if a fix will not be provided, I request confirmation that the workaround (periodically stopping and restarting `bt_conn_le_create_auto()`) is expected to work reliably given your insights into the SDC internals.

Thank you for your support.

0001-Sample-to-reproduce-error-with-long-running-bt_conn_.patch

  • Hi,

    I made an error when building the sample in the NCS environment which I set up for this purpose. Previously I have been working in an old version of NCS which does not use sysbuild. So I am used to hci_rpmsg being built as a child image by default when Bluetooth is enabled. I didn't notice that my tests did not build and flash a netcore image, so I have been running with a hci_rpmsg app built from NCS 2.5.1 in the netcore.

    I copied `Kconfig.sysbuild` from one of the projects under nrf/samples/bluetooth and re-built. Now my HCI and LMP version matches yours.

    I will repeat the experiment and report back. I expect it will work without error, same as it did for you.

    My apologies.

Related