Long running bt_conn_le_create_auto() breaks Bluetooth controller (nRF5340)

Hi,

Background

For some time now I have been troubled by a bug in one of our products acting as a BLE central device. The bug manifests itself as the central reporting that it is connected to several peripherals, but it receives no data from the peripherals and no disconnect events are generated when the peripherals are power-cycled. The peripherals themselves advertise, indicating that they are in fact not connected to any central.

The central works by receiving a list of assigned periphal devices to which it maintains connections at all times. All peripherals are added to the filter accept list (FAL) upon assignment, and a connection supervision routine calls `bt_conn_le_create_auto()` when needed to connect/reconnect to the assigned peripherals.

Our naive use of the FAL and `bt_conn_le_create_auto()` means that if a device is assigned, but physically unavailable, we will forever have an outstanding `bt_conn_le_create_auto()` attempting to establish a connection to that unavailable node. This long running `bt_conn_le_create_auto()` attempt is what is causing the failure.

Description and How to Reproduce

I have found a small set of steps that can reliably reproduce the bug:

1. Set up FAL of to contain 3 addresses:
   - two devices running the peripheral_hr Zephyr sample on nRF5340 hardware
   - one address corresponding to a physically unavailable device
2. Continually call `bt_conn_le_create_auto()` whenever a device disconnects or reconnects.
3. Wait 50 minutes then power-cycle one of the peripherals.
4. The central will notice a disconnection event for the power-cycled device, but the second device will be silently dropped and outstanding `bt_conn_le_create_auto()` initiation attempt will not be able to connect to any more devices.
   
In the resulting state, our experience is that the BT controller will no longer function as expected and many BT operations results in assertion errors. This part is not demonstrated by the included sample.

See attached a sample application for nRF5340DK using latest NCS release (v3.2.4). See attached patch file (apply to the Zephyr repo at tag `ncs-v3.2.4`).

Additional Information

Our production application differs from the sample in significant ways. It's built on an older version of NCS (2.5.1), uses MPSL to run a proprietery protocol in addition to BLE, uses different connection and scan intervals. The common denominator is the long outstanding `bt_conn_le_create_auto()`. Still it takes 50 minutes (exact figure unknown but 40 minutes is too short) before the bug can be triggered.

It seems the problom can be worked around by avoiding long running `bt_conn_le_create_auto()` requests. By stopping with `bt_conn_create_auto_stop()` and restarting, the bug disappears. Calling `bt_conn_create_auto_stop()` and resuming works to avoid the bug even if 50 minutes has already passed.

Request

I would like a Nordic engineer to investigate this issue and determine whether this is a bug in the SoftDevice Controller (SDC) that requires a fix, or if the observed behavior is expected. If it is expected or if a fix will not be provided, I request confirmation that the workaround (periodically stopping and restarting `bt_conn_le_create_auto()`) is expected to work reliably given your insights into the SDC internals.

Thank you for your support.

0001-Sample-to-reproduce-error-with-long-running-bt_conn_.patch

Parents
  • Hi Alexander, 

    I tried a few times but haven't managed to reproduce the issue. Could you check the log and see if there is anything I should change ? I waited more than 50 minutes (also tried again after 3 hours). But it didn't seem to get to the state that it couldn't reconnect. 

    [00:58:50.890,869] <inf> fal_central: Status:
    
    [00:58:50.890,960] <inf> fal_central: ED:16:9F:CF:23:DB (random): connected
    
    [00:58:50.891,021] <inf> fal_central: C1:0C:F6:A3:A4:66 (random): connected
    
    [00:58:50.891,113] <inf> fal_central: FF:FF:FF:FF:FF:FF (public): connecting
    
    [00:58:55.891,204] <inf> fal_central: Status:
    
    [00:58:55.891,296] <inf> fal_central: ED:16:9F:CF:23:DB (random): connected
    
    [00:58:55.891,357] <inf> fal_central: C1:0C:F6:A3:A4:66 (random): connected
    
    [00:58:55.891,448] <inf> fal_central: FF:FF:FF:FF:FF:FF (public): connecting
    
    [00:58:59.008,636] <inf> fal_central: Disconnected: ED:16:9F:CF:23:DB (random) (reason 0x08)
    
    [00:58:59.008,758] <inf> fal_central: Status:
    [00:58:59.008,850] <inf> fal_central: C1:0C:F6:A3:A4:66 (random): connected
    
    [00:58:59.008,911] <inf> fal_central: FF:FF:FF:FF:FF:FF (public): connecting
    
    [00:59:04.009,002] <inf> fal_central: Status:
    
    [00:59:04.009,094] <inf> fal_central: C1:0C:F6:A3:A4:66 (random): connected
    
    [00:59:18.399,963] <inf> fal_central: Status:
    :FF:FF (public): connecti
    [00:59:18.400,054] <inf> fal_central: FF:FF:FF:FF:FF:FF (public): connecting
    9.009,246] <inf> fal_central: Status:
    
    [00:59:18.400,115] <inf> fal_central: C1:0C:F6:A3:A4:66 (random): connected
    
    [00:59:18.400,207] <inf> fal_central: ED:16:9F:CF:23:DB (random): connected
    
    
    [00:59:23.400,299] <inf> fal_central: Status:
    :16:9F:CF:23:DB (random)
    [00:59:23.400,390] <inf> fal_central: FF:FF:FF:FF:FF:FF (public): connecting
    3.398,712] <inf> fal_central: Status:
    
    [00:59:23.400,451] <inf> fal_central: C1:0C:F6:A3:A4:66 (random): connected
    
    [00:59:23.400,543] <inf> fal_central: ED:16:9F:CF:23:DB (random): connected
    
    [00:59:28.400,634] <inf> fal_central: Status:
    started
    
    [00:59:28.400,726] <inf> fal_central: FF:FF:FF:FF:FF:FF (public): connecting
    
    [00:59:28.400,787] <inf> fal_central: C1:0C:F6:A3:A4:66 (random): connected
    
    [00:59:28.400,878] .0m<inf> fal_central: ED:16:9F:CF:23:DB (random): connected

    I haven't looked deep in the code, but not sure why FF:FF:FF:FF:FF:FF is the ghost address that it's trying to connect to. 
    Have you reproduced the issue on all board/DK you used or only on some of them ? I was using 2 nRF54L DK as the peripheral. Not so sure if they have to be nRF5340. 

Reply
  • Hi Alexander, 

    I tried a few times but haven't managed to reproduce the issue. Could you check the log and see if there is anything I should change ? I waited more than 50 minutes (also tried again after 3 hours). But it didn't seem to get to the state that it couldn't reconnect. 

    [00:58:50.890,869] <inf> fal_central: Status:
    
    [00:58:50.890,960] <inf> fal_central: ED:16:9F:CF:23:DB (random): connected
    
    [00:58:50.891,021] <inf> fal_central: C1:0C:F6:A3:A4:66 (random): connected
    
    [00:58:50.891,113] <inf> fal_central: FF:FF:FF:FF:FF:FF (public): connecting
    
    [00:58:55.891,204] <inf> fal_central: Status:
    
    [00:58:55.891,296] <inf> fal_central: ED:16:9F:CF:23:DB (random): connected
    
    [00:58:55.891,357] <inf> fal_central: C1:0C:F6:A3:A4:66 (random): connected
    
    [00:58:55.891,448] <inf> fal_central: FF:FF:FF:FF:FF:FF (public): connecting
    
    [00:58:59.008,636] <inf> fal_central: Disconnected: ED:16:9F:CF:23:DB (random) (reason 0x08)
    
    [00:58:59.008,758] <inf> fal_central: Status:
    [00:58:59.008,850] <inf> fal_central: C1:0C:F6:A3:A4:66 (random): connected
    
    [00:58:59.008,911] <inf> fal_central: FF:FF:FF:FF:FF:FF (public): connecting
    
    [00:59:04.009,002] <inf> fal_central: Status:
    
    [00:59:04.009,094] <inf> fal_central: C1:0C:F6:A3:A4:66 (random): connected
    
    [00:59:18.399,963] <inf> fal_central: Status:
    :FF:FF (public): connecti
    [00:59:18.400,054] <inf> fal_central: FF:FF:FF:FF:FF:FF (public): connecting
    9.009,246] <inf> fal_central: Status:
    
    [00:59:18.400,115] <inf> fal_central: C1:0C:F6:A3:A4:66 (random): connected
    
    [00:59:18.400,207] <inf> fal_central: ED:16:9F:CF:23:DB (random): connected
    
    
    [00:59:23.400,299] <inf> fal_central: Status:
    :16:9F:CF:23:DB (random)
    [00:59:23.400,390] <inf> fal_central: FF:FF:FF:FF:FF:FF (public): connecting
    3.398,712] <inf> fal_central: Status:
    
    [00:59:23.400,451] <inf> fal_central: C1:0C:F6:A3:A4:66 (random): connected
    
    [00:59:23.400,543] <inf> fal_central: ED:16:9F:CF:23:DB (random): connected
    
    [00:59:28.400,634] <inf> fal_central: Status:
    started
    
    [00:59:28.400,726] <inf> fal_central: FF:FF:FF:FF:FF:FF (public): connecting
    
    [00:59:28.400,787] <inf> fal_central: C1:0C:F6:A3:A4:66 (random): connected
    
    [00:59:28.400,878] .0m<inf> fal_central: ED:16:9F:CF:23:DB (random): connected

    I haven't looked deep in the code, but not sure why FF:FF:FF:FF:FF:FF is the ghost address that it's trying to connect to. 
    Have you reproduced the issue on all board/DK you used or only on some of them ? I was using 2 nRF54L DK as the peripheral. Not so sure if they have to be nRF5340. 

Children
  • Hi again,

    Thanks for testing. The log you sent is a little bit broken up even after stripping out terminal control codes. Can you confirm that you disconnected power to one device and then returned power to it while the other device had its connection unbroken?

    This is incredibly frustrating. In my tests the 50 minute wait has had a 100% success rate in reproducing the error across both the nRF5340DK and our own hardware.

    > I haven't looked deep in the code, but not sure why FF:FF:FF:FF:FF:FF is the ghost address that it's trying to connect to.


    The FF:FF:FF:FF:FF:FF is a placeholder address used for the bt_conn object which is allocated as a placeholder by the Bluetooth host stack for the pending connection. Because we're using the FAL to connect to an unknown address it needs to use a placeholder address. We see it because bt_conn_foreach loops over all allocated bt_conn objects, not just the connected ones.

    These are the versions printed on boot by the board that I used last to test this. Can you compare with your output to see if there are differences?


    *** Booting nRF Connect SDK v3.2.4-4c3fc0d44534 ***
    *** Using Zephyr OS v4.2.99-cb8129641703 ***
    [00:00:00.287,170] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
    [00:00:00.287,200] <inf> bt_hci_core: HW Variant: nRF53x (0x0003)
    [00:00:00.287,231] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 197.47763 Build 2370639017
    [00:00:00.289,154] <inf> bt_hci_core: HCI transport: IPC
    [00:00:00.289,245] <inf> bt_hci_core: Identity: C0:64:42:19:94:69 (random)
    [00:00:00.289,276] <inf> bt_hci_core: HCI: version 5.4 (0x0d) revision 0x2102, manufacturer 0x0059
    [00:00:00.289,306] <inf> bt_hci_core: LMP: version 5.4 (0x0d) subver 0x2102
    [00:00:00.289,306] <inf> fal_central: Bluetooth initialised
    [00:00:00.290,191] <inf> fal_central: FAL[0] D9:BE:DD:51:E8:F5  (real)
    [00:00:00.290,649] <inf> fal_central: FAL[1] E4:21:B1:13:70:F3  (real)
    [00:00:00.291,107] <inf> fal_central: FAL[2] DE:AD:DE:AD:DE:AD  (ghost)
    [00:00:00.291,839] <inf> fal_central: FAL initiator started
    [00:00:05.291,900] <inf> fal_central: Status:
    [00:00:05.291,992] <inf> fal_central:   FF:FF:FF:FF:FF:FF (public): connecting
    [00:00:10.292,083] <inf> fal_central: Status:
    [00:00:10.292,175] <inf> fal_central:   FF:FF:FF:FF:FF:FF (public): connecting

  • Hi Alexander, 
    Please see the log with better formatting. You can find that 2 devices connected at the beginning and then I turned the power off on one of the board, resulting a disconnect. Only one is connected at this time. Then I turned it back on after a while and it get connected back on. 

    [04:21:33.639,678] <inf> fal_central: Status:                                   
    [04:21:33.639,770] <inf> fal_central:   C1:0C:F6:A3:A4:66 (random): connected   
    [04:21:33.639,831] <inf> fal_central:   ED:16:9F:CF:23:DB (random): connected   
    [04:21:33.639,923] <inf> fal_central:   FF:FF:FF:FF:FF:FF (public): connecting  
    [04:21:38.640,014] <inf> fal_central: Status:                                   
    [04:21:38.640,106] <inf> fal_central:   C1:0C:F6:A3:A4:66 (random): connected   
    [04:21:38.640,167] <inf> fal_central:   ED:16:9F:CF:23:DB (random): connected   
    [04:21:38.640,258] <inf> fal_central:   FF:FF:FF:FF:FF:FF (public): connecting  
    [04:21:42.461,151] <inf> fal_central: Disconnected: C1:0C:F6:A3:A4:66 (random) (
    reason 0x08)                                                                    
    [04:21:42.461,242] <inf> fal_central: Status:                                   
    [04:21:42.461,334] <inf> fal_central:   ED:16:9F:CF:23:DB (random): connected   
    [04:21:42.461,395] <inf> fal_central:   FF:FF:FF:FF:FF:FF (public): connecting  
    [04:21:47.461,517] <inf> fal_central: Status:                                   
    [04:21:47.461,608] <inf> fal_central:   ED:16:9F:CF:23:DB (random): connected   
    [04:21:47.461,669] <inf> fal_central:   FF:FF:FF:FF:FF:FF (public): connectng   
    [04:21:52.461,761] <inf> fal_central: Status:                                   
    [04:21:52.461,853] <inf> fal_central:   ED:16:9F:CF:23:DB (random): connected   
    [04:21:52.461,914] <inf> fal_central:   FF:FF:FF:FF:FF:FF (public): connecting  
    [04:21:52.786,865] <inf> fal_central: Connected: C1:0C:F6:A3:A4:66 (random)     
    [04:21:52.787,017] <inf> fal_central: Status:                                   
    [04:21:52.787,109] <inf> fal_central:   ED:16:9F:CF:23:DB (random): connected   
    [04:21:52.787,170] <inf> fal_central:   C1:0C:F6:A3:A4:66 (random): connected   
    [04:21:52.788,208] <inf> fal_central: FAL initiator started                     
    [04:21:57.788,299] <inf> fal_central: Status:                                   
    [04:21:57.788,391] <inf> fal_central:   FF:FF:FF:FF:FF:FF (public): connecting  
    [04:21:57.788,452] <inf> fal_central:   ED:16:9F:CF:23:DB (random): connected   
    [04:21:57.788,543] <inf> fal_central:   C1:0C:F6  3:A4:66 (random): connected   
    [04:22:02.788,635] <inf> fal_central: Status:                                   
    [04:22:02.788,726] <inf> fal_central:   FF:FF:FF:FF:FF:FF (public): connecting  
    [04:22:02.788,787] <inf> fal_central:   ED:16:9F:CF:23:DB (random): connected   
    [04:22:02.788,879] <inf> fal_central:   C1:0C:F6:A3:A4:66 (random): connected  
    


    Here is the log when booting it up: 
    *** Booting nRF Connect SDK v3.2.4-4c3fc0d44534 ***                             
    *** Using Zephyr OS v4.2.99-9673eec75908 ***                                    
    [00:00:00.287,811] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
    [00:00:00.287,841] <inf> bt_hci_core: HW Variant: nRF53x (0x0003)               
    [00:00:00.287,872] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0
    x00) Version 130.51898 Build 1015905744                                         
    [00:00:00.290,405] <inf> bt_hci_core: HCI transport: IPC                        
    [00:00:00.290,496] <inf> bt_hci_core: Identity: C5:B3:9B:EE:C7:61 (random)      
    [00:00:00.290,527] <inf> bt_hci_core: HCI: version 6.2 (0x10) revision 0x207e, m
    anufacturer 0x0059                                                              
    [00:00:00.290,557] <inf> bt_hci_core: LMP: version 6.2 (0x10) subver 0x207e     
    [00:00:00.290,557] <inf> fal_central: Bluetooth initialised                     
    [00:00:00.291,717] <inf> fal_central: FAL[0] ED:16:9F:CF:23:DB  (real)          
    [00:00:00.292,327] <inf> fal_central: FAL[1] C1:0C:F6:A3:A4:66  (real)          
    [00:00:00.292,938] <inf> fal_central: FAL[2] DE:AD:DE:AD:DE:AD  (ghost)         
    [00:00:00.293,701] <inf> fal_cennral: FAL initiator started                     
    [00:00:01.022,613] <inf> fal_central: Connected: ED:16:9F:CF:23:DB (random)     
    [00:00:01.022,796] <inf> fal_central: Status:                                   
    [00:00:01.022,888] <inf> fal_central:   ED:16:9F:CF:23:DB (random): connected   
    [00:00:01.024,261] <inf> fal_central: FAL initiator started                     
    [00:00:01.161,376] <inf> fal_central: Connected: C1:0C:F6:A3:A4:66 (random)     
    [00:00:01.161,529] <inf> fal_central: Status:                                   
    [00:00:01.161,621] <inf> fal_central:   ED:16:9F:CF:23:DB (random): connected   
    [00:00:01.161,712] <inf> fal_central:   C1:0C:F6:A3:A4:66 (random): connected   
    [00:00:01.162,719] <inf> fal_central: FAL initiator started   
    
    *** Booting nRF Connect SDK v3.2.4-4c3fc0d44534 ***                             
    *** Using Zephyr OS v4.2.99-9673eec75908 ***                                    
    [00:00:00.287,811] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
    [00:00:00.287,841] <inf> bt_hci_core: HW Variant: nRF53x (0x0003)               
    [00:00:00.287,872] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0
    x00) Version 130.51898 Build 1015905744                                         
    [00:00:00.290,405] <inf> bt_hci_core: HCI transport: IPC                        
    [00:00:00.290,496] <inf> bt_hci_core: Identity: C5:B3:9B:EE:C7:61 (random)      
    [00:00:00.290,527] <inf> bt_hci_core: HCI: version 6.2 (0x10) revision 0x207e, m
    anufacturer 0x0059                                                              
    [00:00:00.290,557] <inf> bt_hci_core: LMP: version 6.2 (0x10) subver 0x207e     
    [00:00:00.290,557] <inf> fal_central: Bluetooth initialised                     
    [00:00:00.291,717] <inf> fal_central: FAL[0] ED:16:9F:CF:23:DB  (real)          
    [00:00:00.292,327] <inf> fal_central: FAL[1] C1:0C:F6:A3:A4:66  (real)          
    [00:00:00.292,938] <inf> fal_central: FAL[2] DE:AD:DE:AD:DE:AD  (ghost)         
    [00:00:00.293,701] <inf> fal_cennral: FAL initiator started                     
    [00:00:01.022,613] <inf> fal_central: Connected: ED:16:9F:CF:23:DB (random)     
    [00:00:01.022,796] <inf> fal_central: Status:                                   
    [00:00:01.022,888] <inf> fal_central:   ED:16:9F:CF:23:DB (random): connected   
    [00:00:01.024,261] <inf> fal_central: FAL initiator started                     
    [00:00:01.161,376] <inf> fal_central: Connected: C1:0C:F6:A3:A4:66 (random)     
    [00:00:01.161,529] <inf> fal_central: Status:                                   
    [00:00:01.161,621] <inf> fal_central:   ED:16:9F:CF:23:DB (random): connected   
    [00:00:01.161,712] <inf> fal_central:   C1:0C:F6:A3:A4:66 (random): connected   
    [00:00:01.162,719] <inf> fal_central: FAL initiator started   
    


    There is some difference in the HCI version and LMP version. Could you give more information about the ipc_radio firmware you used for the netcore ? What configuration ? I don't see the netcore image included in your fal_central project. 

  • Hi,

    I made an error when building the sample in the NCS environment which I set up for this purpose. Previously I have been working in an old version of NCS which does not use sysbuild. So I am used to hci_rpmsg being built as a child image by default when Bluetooth is enabled. I didn't notice that my tests did not build and flash a netcore image, so I have been running with a hci_rpmsg app built from NCS 2.5.1 in the netcore.

    I copied `Kconfig.sysbuild` from one of the projects under nrf/samples/bluetooth and re-built. Now my HCI and LMP version matches yours.

    I will repeat the experiment and report back. I expect it will work without error, same as it did for you.

    My apologies.

Related