BLE scan stalls on nRF52840 after several days of successful scan (NCS 2.9.0) (no connections)

We have an nRF52840 device that is required to do continuous BLE scanning in central role to find nearby advertising beacons and is never required to establish a connection with a peripheral. 

We make use of NCS SDK 2.9.0 and the nRF52840dk for this purpose.

Here are the scan parameters:

struct bt_le_scan_param scan_param = {
		.type       = BT_LE_SCAN_TYPE_ACTIVE,
		.options    = BT_LE_SCAN_OPT_NONE,
		.interval   = BT_GAP_SCAN_FAST_INTERVAL,
		.window     = BT_GAP_SCAN_FAST_WINDOW,
        .interval_coded = BT_GAP_SCAN_FAST_INTERVAL,
        .window_coded = BT_GAP_SCAN_FAST_WINDOW,
        .timeout = 0
};

struct bt_scan_init_param scan_init = {
        .connect_if_match = 0,
		.scan_param = &scan_param,
		.conn_param = NULL
};

In the application we have added one filter for complete device name and one filter for manufacturer specific payload with a specific company ID. These settings, however, apparently have no impact on the issue at hand, since the BLE scanning seems to stall completely even at the driver level: HCI layer or bt_scan.

The issue happens after many days of autonomy (~3 days or more), where the BLE scan suddenly stalls, but the application is still running (no crash).  

To investigate this issue we activated the log messages with DEBUG level for HCI Driver and BT scan using the following configurations:

CONFIG_BT_HCI_DRIVER_LOG_LEVEL_DBG=y

CONFIG_BT_SCAN_LOG_LEVEL_DBG=y
This produces tons of log messages in the RTT console, since we have many Bluetooth advertising devices/beacons present in our office. The logs look like this
00> [04:17:03.689,910] <dbg> bt_hci_core: rx_work_handler: Getting net_buf from queue
00> [04:17:03.689,910] <dbg> bt_hci_core: rx_work_handler: buf 0x20016358 type 1 len 59
00> [04:17:03.689,941] <dbg> bt_hci_core: hci_event: event 0x3e
00> [04:17:03.689,941] <dbg> bt_hci_core: hci_le_meta_event: subevent 0x0d
00> [04:17:03.689,971] <dbg> bt_scan: bt_hci_le_adv_ext_report: Adv number of reports 1
00> [04:17:03.690,063] <dbg> bt_scan: le_adv_recv: 38:F5:F0:96:53:D3 (random) event 3, len 31, rssi -82 dBm
00> [04:17:03.694,213] <dbg> bt_sdc_hci_driver: event_packet_process: LE Meta Event (0x0d), len (51)
00> [04:17:03.694,244] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x20016358 len 53
00> [04:17:03.694,335] <dbg> bt_hci_core: rx_work_handler: Getting net_buf from queue
00> [04:17:03.694,335] <dbg> bt_hci_core: rx_work_handler: buf 0x20016358 type 1 len 53
00> [04:17:03.694,366] <dbg> bt_hci_core: hci_event: event 0x3e
00> [04:17:03.694,366] <dbg> bt_hci_core: hci_le_meta_event: subevent 0x0d
00> [04:17:03.694,396] <dbg> bt_scan: bt_hci_le_adv_ext_report: Adv number of reports 1
00> [04:17:03.694,488] <dbg> bt_scan: le_adv_recv: E7:88:CC:90:27:E7 (random) event 3, len 25, rssi -75 dBm
00> [04:17:03.698,822] <dbg> bt_sdc_hci_driver: event_packet_process: LE Meta Event (0x0d), len (43)
00> [04:17:03.698,852] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x20016358 len 45
00> [04:17:03.698,944] <dbg> bt_hci_core: rx_work_handler: Getting net_buf from queue
00> [04:17:03.698,944] <dbg> bt_hci_core: rx_work_handler: buf 0x20016358 type 1 len 45
00> [04:17:03.698,974] <dbg> bt_hci_core: hci_event: event 0x3e
00> [04:17:03.698,974] <dbg> bt_hci_core: hci_le_meta_event: subevent 0x0d
00> [04:17:03.699,005] <dbg> bt_scan: bt_hci_le_adv_ext_report: Adv number of reports 1
00> [04:17:03.699,127] <dbg> bt_scan: le_adv_recv: 74:11:20:42:9B:EF (random) event 0, len 17, rssi -58 dBm
00> [04:17:03.699,401] <dbg> bt_sdc_hci_driver: event_packet_process: LE Meta Event (0x0d), len (26)
00> [04:17:03.699,432] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x20016358 len 28
00> [04:17:03.699,523] <dbg> bt_hci_core: rx_work_handler: Getting net_buf from queue
00> [04:17:03.699,523] <dbg> bt_hci_core: rx_work_handler: buf 0x20016358 type 1 len 28
00> [04:17:03.699,554] <dbg> bt_hci_core: hci_event: event 0x3e
00> [04:17:03.699,554] <dbg> bt_hci_core: hci_le_meta_event: subevent 0x0d
00> [04:17:03.699,584] <dbg> bt_scan: bt_hci_le_adv_ext_report: Adv number of reports 1
00> [04:17:03.699,676] <dbg> bt_scan: le_adv_recv: 74:11:20:42:9B:EF (random) event 4, len 0, rssi -58 dBm
00> [04:17:03.702,117] <dbg> bt_sdc_hci_driver: event_packet_process: LE Meta Event (0x0d), len (43)
00> [04:17:03.702,148] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x20016358 len 45
After about 3 days, the logs stop completely for both HCI core and the BT scan without any warning or error messages in the logs. Here are the logs when the radio becomes suddenly silent (nothing seems out of ordinary):
[04:38:20.135,437] <dbg> bt_hci_core: hci_le_meta_event: subevent 0x0d
00> [04:38:20.135,437] <dbg> bt_scan: bt_hci_le_adv_ext_report: Adv number of reports 1
00> [04:38:20.135,559] <dbg> bt_scan: le_adv_recv: 66:1E:27:CB:49:D7 (random) event 0, len 18, rssi -65 dBm
00> [04:38:20.135,864] <dbg> bt_sdc_hci_driver: event_packet_process: LE Meta Event (0x0d), len (26)
00> [04:38:20.135,894] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x20016358 len 28
00> [04:38:20.135,986] <dbg> bt_hci_core: rx_work_handler: Getting net_buf from queue
00> [04:38:20.135,986] <dbg> bt_hci_core: rx_work_handler: buf 0x20016358 type 1 len 28
00> [04:38:20.136,016] <dbg> bt_hci_core: hci_event: event 0x3e
00> [04:38:20.136,016] <dbg> bt_hci_core: hci_le_meta_event: subevent 0x0d
00> [04:38:20.136,047] <dbg> bt_scan: bt_hci_le_adv_ext_report: Adv number of reports 1
00> [04:38:20.136,138] <dbg> bt_scan: le_adv_recv: 66:1E:27:CB:49:D7 (random) event 4, len 0, rssi -65 dBm
00> [04:38:20.142,547] <dbg> bt_sdc_hci_driver: event_packet_process: LE Meta Event (0x0d), len (57)
00> [04:38:20.142,547] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x20016358 len 59
00> [04:38:20.142,639] <dbg> bt_hci_core: rx_work_handler: Getting net_buf from queue
00> [04:38:20.142,639] <dbg> bt_hci_core: rx_work_handler: buf 0x20016358 type 1 len 59
00> [04:38:20.142,669] <dbg> bt_hci_core: hci_event: event 0x3e
00> [04:38:20.142,669] <dbg> bt_hci_core: hci_le_meta_event: subevent 0x0d
00> [04:38:20.142,700] <dbg> bt_scan: bt_hci_le_adv_ext_report: Adv number of reports 1
00> [04:38:20.142,791] <dbg> bt_scan: le_adv_recv: 0F:26:49:3E:DC:16 (random) event 3, len 31, rssi -69 dBm
00> [04:38:20.144,073] <dbg> bt_sdc_hci_driver: event_packet_process: LE Meta Event (0x0d), len (57)
00> [04:38:20.144,104] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x20016358 len 59
00> [04:38:20.144,165] <dbg> bt_hci_core: rx_work_handler: Getting net_buf from queue
00> [04:38:20.144,195] <dbg> bt_hci_core: rx_work_handler: buf 0x20016358 type 1 len 59
00> [04:38:20.144,195] <dbg> bt_hci_core: hci_event: event 0x3e
00> [04:38:20.144,195] <dbg> bt_hci_core: hci_le_meta_event: subevent 0x0d
00> [04:38:20.144,226] <dbg> bt_scan: bt_hci_le_adv_ext_report: Adv number of reports 1
00> [04:38:20.144,348] <dbg> bt_scan: le_adv_recv: 19:98:96:FA:BF:96 (random) event 3, len 31, rssi -72 dBm
When the radio scan becomes silent, in the application we wait for 10 minutes, and then we try stopping the scan: bt_le_scan_stop() and then restarting it:
bt_scan_start(BT_LE_SCAN_TYPE_ACTIVE)
Both stop/start scan functions returned an OK "0" response. We can also see the "Scanning" debug message appearing in the nRF logs as the last message indicating that the scan is restarted successfully. The HCI layer also displays relevant messages as shown below (representing the scan stop and start function calls):
[04:48:14.201,171] <wrn> protocol_thread: No BLE packet for a while, restarting scan...
00> [04:48:14.201,202] <dbg> bt_hci_core: bt_hci_cmd_create: opcode 0x2042 param_len 6
00> [04:48:14.201,202] <dbg> bt_hci_core: bt_hci_cmd_create: buf 0x200166b8
00> [04:48:14.201,232] <dbg> bt_hci_core: bt_hci_cmd_send_sync: buf 0x200166b8 opcode 0x2042 len 9
00> [04:48:14.201,232] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX
00> [04:48:14.201,263] <dbg> bt_hci_core: tx_processor: TX process start
00> [04:48:14.201,293] <dbg> bt_hci_core: hci_core_send_cmd: fetch cmd
00> [04:48:14.201,324] <dbg> bt_hci_core: hci_core_send_cmd: Sending command 0x2042 (buf 0x200166b8) to driver
00> [04:48:14.201,324] <dbg> bt_hci_core: bt_send: buf 0x200166b8 len 9 type 0
00> [04:48:14.201,354] <dbg> bt_sdc_hci_driver: hci_driver_send: 
00> [04:48:14.201,354] <dbg> bt_sdc_hci_driver: cmd_handle: 
00> [04:48:14.202,087] <dbg> bt_sdc_hci_driver: hci_driver_send: Exit: 0
00> [04:48:14.202,117] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX
00> [04:48:14.202,178] <dbg> bt_sdc_hci_driver: event_packet_process: Command Complete (0x2042) status: 0x00, ncmd: 1, len 4
00> [04:48:14.202,209] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x20016494 len 6
00> [04:48:14.202,209] <dbg> bt_hci_core: hci_cmd_complete: opcode 0x2042
00> [04:48:14.202,239] <dbg> bt_hci_core: hci_cmd_done: opcode 0x2042 status 0x00  buf 0x20016494
00> [04:48:14.202,270] <dbg> bt_hci_core: hci_cmd_done: sync cmd released
00> [04:48:14.202,301] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX
00> [04:48:14.202,362] <dbg> bt_hci_core: tx_processor: TX process start
00> [04:48:14.202,392] <dbg> bt_hci_core: bt_hci_cmd_send_sync: rsp 0x200166b8 opcode 0x2042 len 1
00> [04:48:14.202,423] <dbg> ble_central: BLE_CENTRAL_Scan: Scan OFF
00> [04:48:14.202,453] <dbg> bt_hci_core: bt_hci_cmd_create: opcode 0x2041 param_len 8
00> [04:48:14.202,484] <dbg> bt_hci_core: bt_hci_cmd_create: buf 0x200166b8
00> [04:48:14.202,484] <dbg> bt_hci_core: bt_hci_cmd_send_sync: buf 0x200166b8 opcode 0x2041 len 11
00> [04:48:14.202,514] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX
00> [04:48:14.202,545] <dbg> bt_hci_core: tx_processor: TX process start
00> [04:48:14.202,545] <dbg> bt_hci_core: hci_core_send_cmd: fetch cmd
00> [04:48:14.202,575] <dbg> bt_hci_core: hci_core_send_cmd: Sending command 0x2041 (buf 0x200166b8) to driver
00> [04:48:14.202,575] <dbg> bt_hci_core: bt_send: buf 0x200166b8 len 11 type 0
00> [04:48:14.202,606] <dbg> bt_sdc_hci_driver: hci_driver_send: 
00> [04:48:14.202,606] <dbg> bt_sdc_hci_driver: cmd_handle: 
00> [04:48:14.202,636] <dbg> bt_sdc_hci_driver: hci_driver_send: Exit: 0
00> [04:48:14.202,667] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX
00> [04:48:14.202,697] <dbg> bt_sdc_hci_driver: event_packet_process: Command Complete (0x2041) status: 0x00, ncmd: 1, len 4
00> [04:48:14.202,728] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x20016494 len 6
00> [04:48:14.202,758] <dbg> bt_hci_core: hci_cmd_complete: opcode 0x2041
00> [04:48:14.202,789] <dbg> bt_hci_core: hci_cmd_done: opcode 0x2041 status 0x00  buf 0x20016494
00> [04:48:14.202,789] <dbg> bt_hci_core: hci_cmd_done: sync cmd released
00> [04:48:14.202,819] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX
00> [04:48:14.202,880] <dbg> bt_hci_core: tx_processor: TX process start
00> [04:48:14.202,941] <dbg> bt_hci_core: bt_hci_cmd_send_sync: rsp 0x200166b8 opcode 0x2041 len 1
00> [04:48:14.202,941] <dbg> bt_hci_core: bt_hci_cmd_create: opcode 0x2042 param_len 6
00> [04:48:14.202,972] <dbg> bt_hci_core: bt_hci_cmd_create: buf 0x200166b8
00> [04:48:14.203,002] <dbg> bt_hci_core: bt_hci_cmd_send_sync: buf 0x200166b8 opcode 0x2042 len 9
00> [04:48:14.203,002] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX
00> [04:48:14.203,033] <dbg> bt_hci_core: tx_processor: TX process start
00> [04:48:14.203,063] <dbg> bt_hci_core: hci_core_send_cmd: fetch cmd
00> [04:48:14.203,063] <dbg> bt_hci_core: hci_core_send_cmd: Sending command 0x2042 (buf 0x200166b8) to driver
00> [04:48:14.203,094] <dbg> bt_hci_core: bt_send: buf 0x200166b8 len 9 type 0
00> [04:48:14.203,094] <dbg> bt_sdc_hci_driver: hci_driver_send: 
00> [04:48:14.203,125] <dbg> bt_sdc_hci_driver: cmd_handle: 
00> [04:48:14.203,216] <dbg> bt_sdc_hci_driver: hci_driver_send: Exit: 0
00> [04:48:14.203,247] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX
00> [04:48:14.203,277] <dbg> bt_sdc_hci_driver: event_packet_process: Command Complete (0x2042) status: 0x00, ncmd: 1, len 4
00> [04:48:14.203,308] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x20016494 len 6
00> [04:48:14.203,308] <dbg> bt_hci_core: hci_cmd_complete: opcode 0x2042
00> [04:48:14.203,369] <dbg> bt_hci_core: hci_cmd_done: opcode 0x2042 status 0x00  buf 0x20016494
00> [04:48:14.203,369] <dbg> bt_hci_core: hci_cmd_done: sync cmd released
00> [04:48:14.203,399] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX
00> [04:48:14.203,460] <dbg> bt_hci_core: tx_processor: TX process start
00> [04:48:14.203,491] <dbg> bt_hci_core: bt_hci_cmd_send_sync: rsp 0x200166b8 opcode 0x2042 len 1
00> [04:48:14.203,521] <dbg> nrf_bt_scan: bt_scan_start: Scanning

While the scan is restarted with success, the radio remains silent and no further logs are captured neither by the HCI driver nor by the bt_scan. No more debug messages are displayed. We repeated this process 7 times (stop/start scan) every 10 minutes, and the behavior was the same with the radio remaining silent and not capturing any nearby beacons, however, the API calls to restart scan (stop/start) are both successful with relevant debug messages in HCI driver and bt_scan layers.

The recovery that works at that point would be a soft reboot of the MCU.

Some additional info: we had only one beacon with the matching scan filters in the office, but dozens of other Bluetooth devices, PCs and beacons are present in our office, which shall keep the low level HCI and bt_scan logs active at all times. This has been the case for the 3 days prior to the occurrence of the blockage where several gigabytes of HCI and bt_scan log messages were produced per day. When the radio became completely silent, we also confirmed using the nRF Connect mobile application that our beacon was still functional and advertising.

We would like to emphasize that the behavior is not easy to reproduce. Any help would be appreciated here. Thank you.

Parents
  • Hi Omid, 
    Besides what Jakob suggested, I would suggest to try connect the Radio event (RX_END for example) to a GPIO using a PPI. 
    This way you can check if the radio is still doing scanning or not. You can also use the PPK2 to check the power consumption to see if the radio is active or not. 


    Do you see the same issue if you test with a stock sample in the SDK ?  or when you test your application with earlier SDK ? 

  • Thank you for the responses Jakob and Hung. I will use an example in the SDK like scan_adv, while removing the advertisement part (scan only), and attach a PPI to the radio event "NRF_RADIO_EVENT_END". I will also use PASSIVE_SCAN as done by the example provided by the SDK NCS 2.9.0. We used ACTIVE_SCAN since our target beacon is a legacy BLE device with limited payload size, and the scan response would have been helpful to add some more info to the advertisement packet. For now, we won't have to add that option. I will track the GPIO pin (corresponding to the radio events) using a logic analyzer. We should be seeing the pin toggle constantly. I would like to keep the JLINK/RTT console active via USB to track logs, so the power measurement will not be precise. I think the PPI and GPIO state already would help a lot to track this issue. I will launch the test and I would probably have to wait for a week or so to see if the blockage takes place or not. I will keep you posted. Thanks again.

  • Hi Bro,

     How's it going now? I encounter the same issue.

  • Hello Gavin,

    I have had no luck in reproducing the same behavior using the NCS scan_adv sample with PPI trace active for radio scan while deactivating the ble adv mode from the example. My nrf52840dk board after a week or so was simply going frozen with the leds are being OFF as if it lost power completely. Every time this happened I had to power cycle the board manually. This is not the same issue as the device was going completely shutdown, but it has blocked me for running such long duration tests. 

    For our project, I temporarily added a software workaround for the moment that if for an extended period of time (say 1hr) no beacon is captured by the ble scan, we perform a soft reboot on nrf52. This is not a solution, but we had to add the patch as we were on a deadline to release a version for our client.

    In my spare time I will keep investing this issue, as it is quite important for us and our clients. I will try to put more updates on this thread in the coming days.

Reply
  • Hello Gavin,

    I have had no luck in reproducing the same behavior using the NCS scan_adv sample with PPI trace active for radio scan while deactivating the ble adv mode from the example. My nrf52840dk board after a week or so was simply going frozen with the leds are being OFF as if it lost power completely. Every time this happened I had to power cycle the board manually. This is not the same issue as the device was going completely shutdown, but it has blocked me for running such long duration tests. 

    For our project, I temporarily added a software workaround for the moment that if for an extended period of time (say 1hr) no beacon is captured by the ble scan, we perform a soft reboot on nrf52. This is not a solution, but we had to add the patch as we were on a deadline to release a version for our client.

    In my spare time I will keep investing this issue, as it is quite important for us and our clients. I will try to put more updates on this thread in the coming days.

Children
No Data
Related