BLE scan stalls on nRF52840 after several days of successful scan (NCS 2.9.0) (no connections)

We have an nRF52840 device that is required to do continuous BLE scanning in central role to find nearby advertising beacons and is never required to establish a connection with a peripheral. 

We make use of NCS SDK 2.9.0 and the nRF52840dk for this purpose.

Here are the scan parameters:

struct bt_le_scan_param scan_param = {
		.type       = BT_LE_SCAN_TYPE_ACTIVE,
		.options    = BT_LE_SCAN_OPT_NONE,
		.interval   = BT_GAP_SCAN_FAST_INTERVAL,
		.window     = BT_GAP_SCAN_FAST_WINDOW,
        .interval_coded = BT_GAP_SCAN_FAST_INTERVAL,
        .window_coded = BT_GAP_SCAN_FAST_WINDOW,
        .timeout = 0
};

struct bt_scan_init_param scan_init = {
        .connect_if_match = 0,
		.scan_param = &scan_param,
		.conn_param = NULL
};

In the application we have added one filter for complete device name and one filter for manufacturer specific payload with a specific company ID. These settings, however, apparently have no impact on the issue at hand, since the BLE scanning seems to stall completely even at the driver level: HCI layer or bt_scan.

The issue happens after many days of autonomy (~3 days or more), where the BLE scan suddenly stalls, but the application is still running (no crash).  

To investigate this issue we activated the log messages with DEBUG level for HCI Driver and BT scan using the following configurations:

CONFIG_BT_HCI_DRIVER_LOG_LEVEL_DBG=y

CONFIG_BT_SCAN_LOG_LEVEL_DBG=y
This produces tons of log messages in the RTT console, since we have many Bluetooth advertising devices/beacons present in our office. The logs look like this
00> [04:17:03.689,910] <dbg> bt_hci_core: rx_work_handler: Getting net_buf from queue
00> [04:17:03.689,910] <dbg> bt_hci_core: rx_work_handler: buf 0x20016358 type 1 len 59
00> [04:17:03.689,941] <dbg> bt_hci_core: hci_event: event 0x3e
00> [04:17:03.689,941] <dbg> bt_hci_core: hci_le_meta_event: subevent 0x0d
00> [04:17:03.689,971] <dbg> bt_scan: bt_hci_le_adv_ext_report: Adv number of reports 1
00> [04:17:03.690,063] <dbg> bt_scan: le_adv_recv: 38:F5:F0:96:53:D3 (random) event 3, len 31, rssi -82 dBm
00> [04:17:03.694,213] <dbg> bt_sdc_hci_driver: event_packet_process: LE Meta Event (0x0d), len (51)
00> [04:17:03.694,244] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x20016358 len 53
00> [04:17:03.694,335] <dbg> bt_hci_core: rx_work_handler: Getting net_buf from queue
00> [04:17:03.694,335] <dbg> bt_hci_core: rx_work_handler: buf 0x20016358 type 1 len 53
00> [04:17:03.694,366] <dbg> bt_hci_core: hci_event: event 0x3e
00> [04:17:03.694,366] <dbg> bt_hci_core: hci_le_meta_event: subevent 0x0d
00> [04:17:03.694,396] <dbg> bt_scan: bt_hci_le_adv_ext_report: Adv number of reports 1
00> [04:17:03.694,488] <dbg> bt_scan: le_adv_recv: E7:88:CC:90:27:E7 (random) event 3, len 25, rssi -75 dBm
00> [04:17:03.698,822] <dbg> bt_sdc_hci_driver: event_packet_process: LE Meta Event (0x0d), len (43)
00> [04:17:03.698,852] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x20016358 len 45
00> [04:17:03.698,944] <dbg> bt_hci_core: rx_work_handler: Getting net_buf from queue
00> [04:17:03.698,944] <dbg> bt_hci_core: rx_work_handler: buf 0x20016358 type 1 len 45
00> [04:17:03.698,974] <dbg> bt_hci_core: hci_event: event 0x3e
00> [04:17:03.698,974] <dbg> bt_hci_core: hci_le_meta_event: subevent 0x0d
00> [04:17:03.699,005] <dbg> bt_scan: bt_hci_le_adv_ext_report: Adv number of reports 1
00> [04:17:03.699,127] <dbg> bt_scan: le_adv_recv: 74:11:20:42:9B:EF (random) event 0, len 17, rssi -58 dBm
00> [04:17:03.699,401] <dbg> bt_sdc_hci_driver: event_packet_process: LE Meta Event (0x0d), len (26)
00> [04:17:03.699,432] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x20016358 len 28
00> [04:17:03.699,523] <dbg> bt_hci_core: rx_work_handler: Getting net_buf from queue
00> [04:17:03.699,523] <dbg> bt_hci_core: rx_work_handler: buf 0x20016358 type 1 len 28
00> [04:17:03.699,554] <dbg> bt_hci_core: hci_event: event 0x3e
00> [04:17:03.699,554] <dbg> bt_hci_core: hci_le_meta_event: subevent 0x0d
00> [04:17:03.699,584] <dbg> bt_scan: bt_hci_le_adv_ext_report: Adv number of reports 1
00> [04:17:03.699,676] <dbg> bt_scan: le_adv_recv: 74:11:20:42:9B:EF (random) event 4, len 0, rssi -58 dBm
00> [04:17:03.702,117] <dbg> bt_sdc_hci_driver: event_packet_process: LE Meta Event (0x0d), len (43)
00> [04:17:03.702,148] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x20016358 len 45
After about 3 days, the logs stop completely for both HCI core and the BT scan without any warning or error messages in the logs. Here are the logs when the radio becomes suddenly silent (nothing seems out of ordinary):
[04:38:20.135,437] <dbg> bt_hci_core: hci_le_meta_event: subevent 0x0d
00> [04:38:20.135,437] <dbg> bt_scan: bt_hci_le_adv_ext_report: Adv number of reports 1
00> [04:38:20.135,559] <dbg> bt_scan: le_adv_recv: 66:1E:27:CB:49:D7 (random) event 0, len 18, rssi -65 dBm
00> [04:38:20.135,864] <dbg> bt_sdc_hci_driver: event_packet_process: LE Meta Event (0x0d), len (26)
00> [04:38:20.135,894] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x20016358 len 28
00> [04:38:20.135,986] <dbg> bt_hci_core: rx_work_handler: Getting net_buf from queue
00> [04:38:20.135,986] <dbg> bt_hci_core: rx_work_handler: buf 0x20016358 type 1 len 28
00> [04:38:20.136,016] <dbg> bt_hci_core: hci_event: event 0x3e
00> [04:38:20.136,016] <dbg> bt_hci_core: hci_le_meta_event: subevent 0x0d
00> [04:38:20.136,047] <dbg> bt_scan: bt_hci_le_adv_ext_report: Adv number of reports 1
00> [04:38:20.136,138] <dbg> bt_scan: le_adv_recv: 66:1E:27:CB:49:D7 (random) event 4, len 0, rssi -65 dBm
00> [04:38:20.142,547] <dbg> bt_sdc_hci_driver: event_packet_process: LE Meta Event (0x0d), len (57)
00> [04:38:20.142,547] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x20016358 len 59
00> [04:38:20.142,639] <dbg> bt_hci_core: rx_work_handler: Getting net_buf from queue
00> [04:38:20.142,639] <dbg> bt_hci_core: rx_work_handler: buf 0x20016358 type 1 len 59
00> [04:38:20.142,669] <dbg> bt_hci_core: hci_event: event 0x3e
00> [04:38:20.142,669] <dbg> bt_hci_core: hci_le_meta_event: subevent 0x0d
00> [04:38:20.142,700] <dbg> bt_scan: bt_hci_le_adv_ext_report: Adv number of reports 1
00> [04:38:20.142,791] <dbg> bt_scan: le_adv_recv: 0F:26:49:3E:DC:16 (random) event 3, len 31, rssi -69 dBm
00> [04:38:20.144,073] <dbg> bt_sdc_hci_driver: event_packet_process: LE Meta Event (0x0d), len (57)
00> [04:38:20.144,104] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x20016358 len 59
00> [04:38:20.144,165] <dbg> bt_hci_core: rx_work_handler: Getting net_buf from queue
00> [04:38:20.144,195] <dbg> bt_hci_core: rx_work_handler: buf 0x20016358 type 1 len 59
00> [04:38:20.144,195] <dbg> bt_hci_core: hci_event: event 0x3e
00> [04:38:20.144,195] <dbg> bt_hci_core: hci_le_meta_event: subevent 0x0d
00> [04:38:20.144,226] <dbg> bt_scan: bt_hci_le_adv_ext_report: Adv number of reports 1
00> [04:38:20.144,348] <dbg> bt_scan: le_adv_recv: 19:98:96:FA:BF:96 (random) event 3, len 31, rssi -72 dBm
When the radio scan becomes silent, in the application we wait for 10 minutes, and then we try stopping the scan: bt_le_scan_stop() and then restarting it:
bt_scan_start(BT_LE_SCAN_TYPE_ACTIVE)
Both stop/start scan functions returned an OK "0" response. We can also see the "Scanning" debug message appearing in the nRF logs as the last message indicating that the scan is restarted successfully. The HCI layer also displays relevant messages as shown below (representing the scan stop and start function calls):
[04:48:14.201,171] <wrn> protocol_thread: No BLE packet for a while, restarting scan...
00> [04:48:14.201,202] <dbg> bt_hci_core: bt_hci_cmd_create: opcode 0x2042 param_len 6
00> [04:48:14.201,202] <dbg> bt_hci_core: bt_hci_cmd_create: buf 0x200166b8
00> [04:48:14.201,232] <dbg> bt_hci_core: bt_hci_cmd_send_sync: buf 0x200166b8 opcode 0x2042 len 9
00> [04:48:14.201,232] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX
00> [04:48:14.201,263] <dbg> bt_hci_core: tx_processor: TX process start
00> [04:48:14.201,293] <dbg> bt_hci_core: hci_core_send_cmd: fetch cmd
00> [04:48:14.201,324] <dbg> bt_hci_core: hci_core_send_cmd: Sending command 0x2042 (buf 0x200166b8) to driver
00> [04:48:14.201,324] <dbg> bt_hci_core: bt_send: buf 0x200166b8 len 9 type 0
00> [04:48:14.201,354] <dbg> bt_sdc_hci_driver: hci_driver_send: 
00> [04:48:14.201,354] <dbg> bt_sdc_hci_driver: cmd_handle: 
00> [04:48:14.202,087] <dbg> bt_sdc_hci_driver: hci_driver_send: Exit: 0
00> [04:48:14.202,117] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX
00> [04:48:14.202,178] <dbg> bt_sdc_hci_driver: event_packet_process: Command Complete (0x2042) status: 0x00, ncmd: 1, len 4
00> [04:48:14.202,209] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x20016494 len 6
00> [04:48:14.202,209] <dbg> bt_hci_core: hci_cmd_complete: opcode 0x2042
00> [04:48:14.202,239] <dbg> bt_hci_core: hci_cmd_done: opcode 0x2042 status 0x00  buf 0x20016494
00> [04:48:14.202,270] <dbg> bt_hci_core: hci_cmd_done: sync cmd released
00> [04:48:14.202,301] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX
00> [04:48:14.202,362] <dbg> bt_hci_core: tx_processor: TX process start
00> [04:48:14.202,392] <dbg> bt_hci_core: bt_hci_cmd_send_sync: rsp 0x200166b8 opcode 0x2042 len 1
00> [04:48:14.202,423] <dbg> ble_central: BLE_CENTRAL_Scan: Scan OFF
00> [04:48:14.202,453] <dbg> bt_hci_core: bt_hci_cmd_create: opcode 0x2041 param_len 8
00> [04:48:14.202,484] <dbg> bt_hci_core: bt_hci_cmd_create: buf 0x200166b8
00> [04:48:14.202,484] <dbg> bt_hci_core: bt_hci_cmd_send_sync: buf 0x200166b8 opcode 0x2041 len 11
00> [04:48:14.202,514] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX
00> [04:48:14.202,545] <dbg> bt_hci_core: tx_processor: TX process start
00> [04:48:14.202,545] <dbg> bt_hci_core: hci_core_send_cmd: fetch cmd
00> [04:48:14.202,575] <dbg> bt_hci_core: hci_core_send_cmd: Sending command 0x2041 (buf 0x200166b8) to driver
00> [04:48:14.202,575] <dbg> bt_hci_core: bt_send: buf 0x200166b8 len 11 type 0
00> [04:48:14.202,606] <dbg> bt_sdc_hci_driver: hci_driver_send: 
00> [04:48:14.202,606] <dbg> bt_sdc_hci_driver: cmd_handle: 
00> [04:48:14.202,636] <dbg> bt_sdc_hci_driver: hci_driver_send: Exit: 0
00> [04:48:14.202,667] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX
00> [04:48:14.202,697] <dbg> bt_sdc_hci_driver: event_packet_process: Command Complete (0x2041) status: 0x00, ncmd: 1, len 4
00> [04:48:14.202,728] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x20016494 len 6
00> [04:48:14.202,758] <dbg> bt_hci_core: hci_cmd_complete: opcode 0x2041
00> [04:48:14.202,789] <dbg> bt_hci_core: hci_cmd_done: opcode 0x2041 status 0x00  buf 0x20016494
00> [04:48:14.202,789] <dbg> bt_hci_core: hci_cmd_done: sync cmd released
00> [04:48:14.202,819] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX
00> [04:48:14.202,880] <dbg> bt_hci_core: tx_processor: TX process start
00> [04:48:14.202,941] <dbg> bt_hci_core: bt_hci_cmd_send_sync: rsp 0x200166b8 opcode 0x2041 len 1
00> [04:48:14.202,941] <dbg> bt_hci_core: bt_hci_cmd_create: opcode 0x2042 param_len 6
00> [04:48:14.202,972] <dbg> bt_hci_core: bt_hci_cmd_create: buf 0x200166b8
00> [04:48:14.203,002] <dbg> bt_hci_core: bt_hci_cmd_send_sync: buf 0x200166b8 opcode 0x2042 len 9
00> [04:48:14.203,002] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX
00> [04:48:14.203,033] <dbg> bt_hci_core: tx_processor: TX process start
00> [04:48:14.203,063] <dbg> bt_hci_core: hci_core_send_cmd: fetch cmd
00> [04:48:14.203,063] <dbg> bt_hci_core: hci_core_send_cmd: Sending command 0x2042 (buf 0x200166b8) to driver
00> [04:48:14.203,094] <dbg> bt_hci_core: bt_send: buf 0x200166b8 len 9 type 0
00> [04:48:14.203,094] <dbg> bt_sdc_hci_driver: hci_driver_send: 
00> [04:48:14.203,125] <dbg> bt_sdc_hci_driver: cmd_handle: 
00> [04:48:14.203,216] <dbg> bt_sdc_hci_driver: hci_driver_send: Exit: 0
00> [04:48:14.203,247] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX
00> [04:48:14.203,277] <dbg> bt_sdc_hci_driver: event_packet_process: Command Complete (0x2042) status: 0x00, ncmd: 1, len 4
00> [04:48:14.203,308] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x20016494 len 6
00> [04:48:14.203,308] <dbg> bt_hci_core: hci_cmd_complete: opcode 0x2042
00> [04:48:14.203,369] <dbg> bt_hci_core: hci_cmd_done: opcode 0x2042 status 0x00  buf 0x20016494
00> [04:48:14.203,369] <dbg> bt_hci_core: hci_cmd_done: sync cmd released
00> [04:48:14.203,399] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX
00> [04:48:14.203,460] <dbg> bt_hci_core: tx_processor: TX process start
00> [04:48:14.203,491] <dbg> bt_hci_core: bt_hci_cmd_send_sync: rsp 0x200166b8 opcode 0x2042 len 1
00> [04:48:14.203,521] <dbg> nrf_bt_scan: bt_scan_start: Scanning

While the scan is restarted with success, the radio remains silent and no further logs are captured neither by the HCI driver nor by the bt_scan. No more debug messages are displayed. We repeated this process 7 times (stop/start scan) every 10 minutes, and the behavior was the same with the radio remaining silent and not capturing any nearby beacons, however, the API calls to restart scan (stop/start) are both successful with relevant debug messages in HCI driver and bt_scan layers.

The recovery that works at that point would be a soft reboot of the MCU.

Some additional info: we had only one beacon with the matching scan filters in the office, but dozens of other Bluetooth devices, PCs and beacons are present in our office, which shall keep the low level HCI and bt_scan logs active at all times. This has been the case for the 3 days prior to the occurrence of the blockage where several gigabytes of HCI and bt_scan log messages were produced per day. When the radio became completely silent, we also confirmed using the nRF Connect mobile application that our beacon was still functional and advertising.

We would like to emphasize that the behavior is not easy to reproduce. Any help would be appreciated here. Thank you.

Related