Long running bt_conn_le_create_auto() breaks Bluetooth controller (nRF5340)

Hi,

Background

For some time now I have been troubled by a bug in one of our products acting as a BLE central device. The bug manifests itself as the central reporting that it is connected to several peripherals, but it receives no data from the peripherals and no disconnect events are generated when the peripherals are power-cycled. The peripherals themselves advertise, indicating that they are in fact not connected to any central.

The central works by receiving a list of assigned periphal devices to which it maintains connections at all times. All peripherals are added to the filter accept list (FAL) upon assignment, and a connection supervision routine calls `bt_conn_le_create_auto()` when needed to connect/reconnect to the assigned peripherals.

Our naive use of the FAL and `bt_conn_le_create_auto()` means that if a device is assigned, but physically unavailable, we will forever have an outstanding `bt_conn_le_create_auto()` attempting to establish a connection to that unavailable node. This long running `bt_conn_le_create_auto()` attempt is what is causing the failure.

Description and How to Reproduce

I have found a small set of steps that can reliably reproduce the bug:

1. Set up FAL of to contain 3 addresses:
   - two devices running the peripheral_hr Zephyr sample on nRF5340 hardware
   - one address corresponding to a physically unavailable device
2. Continually call `bt_conn_le_create_auto()` whenever a device disconnects or reconnects.
3. Wait 50 minutes then power-cycle one of the peripherals.
4. The central will notice a disconnection event for the power-cycled device, but the second device will be silently dropped and outstanding `bt_conn_le_create_auto()` initiation attempt will not be able to connect to any more devices.
   
In the resulting state, our experience is that the BT controller will no longer function as expected and many BT operations results in assertion errors. This part is not demonstrated by the included sample.

See attached a sample application for nRF5340DK using latest NCS release (v3.2.4). See attached patch file (apply to the Zephyr repo at tag `ncs-v3.2.4`).

Additional Information

Our production application differs from the sample in significant ways. It's built on an older version of NCS (2.5.1), uses MPSL to run a proprietery protocol in addition to BLE, uses different connection and scan intervals. The common denominator is the long outstanding `bt_conn_le_create_auto()`. Still it takes 50 minutes (exact figure unknown but 40 minutes is too short) before the bug can be triggered.

It seems the problom can be worked around by avoiding long running `bt_conn_le_create_auto()` requests. By stopping with `bt_conn_create_auto_stop()` and restarting, the bug disappears. Calling `bt_conn_create_auto_stop()` and resuming works to avoid the bug even if 50 minutes has already passed.

Request

I would like a Nordic engineer to investigate this issue and determine whether this is a bug in the SoftDevice Controller (SDC) that requires a fix, or if the observed behavior is expected. If it is expected or if a fix will not be provided, I request confirmation that the workaround (periodically stopping and restarting `bt_conn_le_create_auto()`) is expected to work reliably given your insights into the SDC internals.

Thank you for your support.

0001-Sample-to-reproduce-error-with-long-running-bt_conn_.patch

Parents Reply Children
  • Hi Hung,

    `bt_conn_le_create_auto()` only connects to one device in the FAL after which it needs to be restarted.


    Indeed `bt_conn_le_create_auto()` is called every 5 seconds in the sample I provided. That's just in an effort of trying to simplify the sample as much as possible. One could easily modify the sample to run the status printing task in a delayable work item or similar and remove the timeout from the `k_sem_take` in the main loop. The bug would still persist. In our product application we only refresh the call to `bt_conn_le_create_auto()` when needed. `bt_conn_le_create_auto()` does not seem to be negatively affected by being called when a connection attempt is already ongoing, it just returns `-EALREADY`. The sample works as expected, connecting and reconnecting without issues unless you wait 50 minutes since the last connect event, after which the bug can be triggered.

    The `bt_le_set_auto_conn()` operates differently from `bt_conn_le_create_auto()`. For one it does not use the FAL, in fact it depends explicitly on FAL support being disabled. At least that was the case for the version of NCS we used when initially developing our application. In recent versions of Zephyr it appears to have been removed.

    > From what I can see this function is on the Zephyr host and not covered in our Softdevice controller. I did a quick search and it seems we have something similar reported: 

    Could you be mixing them up? I believe `bt_le_set_auto_conn()` was implemented fully in the host stack without specific controller involvement. While I think `bt_conn_le_create_auto()` necessitates some controller involvement due to the way the FAL is used.


    One reason of my reasons for suspecting the SDC is that after triggering this bug in our production firmware I have observed failed assertions in the IPC layer (the specifics elude me but it was something about `net_buf` flags not being 0). Hitting that assert is not necessary for the controller to drop all connections though. It's just an additional way in which the controller firmware seems to break.


Related