Beware that this post is related to an SDK in maintenance mode
More Info: Consider nRF Connect SDK for new designs

ble_advertising doesn't always start or restart after disconnect?

nRFSDK v17.0.1, S140 v7.2.0

We're using the ble_advertising nRF SDK module to control advertising. We allow one connection at a time, and as soon as the peer disconnects, we want to immediately restart advertising. We try to advertise forever; we never want to time out or stop. We're not using Mesh or anything else, just plain and boring BLE.

We're seeing units in the field that simply "go dead" on BLE, but they are still otherwise running (they connect to the internet via WiFi and/or LTE on different coprocessors). I added a bunch of diagnostics, including the radio_notification module. I added a safeguard that counts radio notification events- if it ever sees less than 30% of the expected number of radio notification "active" events than it should (based on our advertising + connection interval settings) over a 10-minute period, it logs how many it saw and then reboots.

The data came in! Each of these event counts display the number of radio notification callbacks we had over the past 10 minutes that led to "rescue" reboots.

radio: 250/1098 events
radio: 274/1098 events
radio: 233/1098 events
radio: 322/1098 events
radio: 236/1098 events
radio: 289/1098 events
radio: 238/1098 events
radio: 0/1098 events
radio: 295/1098 events
radio: 327/1098 events
radio: 0/1098 events

So, we're seeing two different phenomena here: the assertions that fired with close to 30% (e.g. 322/1098) are likely advertising simply not restarting after losing a connection. The assertions with 0 events mean that somehow SoftDevice never started advertising at all?

Here's how we're initializing advertising:

ble_advertising_init_t init;
memset(&init, 0, sizeof(init));

ble_uuid_t adv_uuids[] = { { OUR_SERVICE_UUID, our_uuid_type } };
init.advdata.flags = BLE_GAP_ADV_FLAGS_LE_ONLY_GENERAL_DISC_MODE;
init.advdata.uuids_complete.uuid_cnt = ARRAY_COUNTOF(adv_uuids);
init.advdata.uuids_complete.p_uuids = adv_uuids;
init.advdata.include_appearance = true;
init.srdata.name_type = BLE_ADVDATA_FULL_NAME;
init.config.ble_adv_on_disconnect_disabled = false; // restart after disconnect
init.config.ble_adv_fast_enabled = true;
init.config.ble_adv_fast_interval = 874; // 546.25ms
init.config.ble_adv_fast_timeout = BLE_GAP_ADV_TIMEOUT_GENERAL_UNLIMITED;
init.evt_handler = on_adv_evt;
init.error_handler = on_adv_error;

memset(&s_adv.adv_params, 0, sizeof(s_adv.adv_params));
APP_ERROR_CHECK(ble_advertising_init(&s_adv, &init));

And then we start advertising:

APP_ERROR_CHECK(ble_advertising_start(&s_adv, BLE_ADV_MODE_FAST));

We have no calls to ever stop advertising, and we never turn off BLE via SoftDevice or shut down SoftDevice itself. Our error handler logs the error and reboots the nRF52840. Our event handler logs, and if the event is BLE_ADV_EVT_IDLE, reboots the nRF52840. We haven't seen any of these reboots in the logs yet, so it looks like when advertising starts, it works, and nobody stops it intentionally.

Does anyone have any idea where I could look to start diagnosing this? Are there any errata in our version of SoftDevice or the nRFSDK that might be in play here?

Thanks in advance,

Charles

0 charles_fi over 1 year ago

More information that might be relevant, now that I've thought about it more:

We're using NRF_SDH_DISPATCH_MODEL_POLLING for our SoftDevice dispatch model. That means that it's possible that our application isn't servicing SoftDevice frequently enough, and SoftDevice's internal BLE event queue is filling up. If our application never gets the BLE_GAP_EVT_DISCONNECTED event, then exactly what I'm seeing might happen.

Can anyone provide any information on how many BLE events SoftDevice will queue up, and what it does when its internal BLE event queue is exhausted?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 charles_fi over 1 year ago

Digging around even more, I found reference to DRGN-15619 in the SoftDevice release notes:

Fixed an issue where the application might not receive a BLE_GAP_EVT_DISCONNECTED event if the application does not continuously pull events from the SoftDevice (DRGN-15619).

I suspect that this might be what's happening here; does anyone have workaround guidance or more details? Specifically, is it only BLE_GAP_EVT_DISCONNECTED events that are missed? Can any other events (connection interval (re-)negotiation, ATT MTU exchange, etc) be missed?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Edvin over 1 year ago
Hello,

It may be that the DRGN-15619 is what you are actually seeing, but it should be possible to work around this.

First of all, you should try to minimize the amount of time you spend in your interrupts. especially the high priority interrupts, such as the softdevice events. If you are using these events to process the data, try to instead copy the data to a buffer, then use the main loop to process the data. This way, the CPU will always be ready to handle new softdevice interrupts.

Pseudo code:

volatile bool processing_required = false; uint8_t my_data[]; softdevice_interrupt(p_ble_evt_t * p_ble_evt) { processing_required = true; for (uint8_t i=0; i<p_ble_evt->evt.data.length; i++) { my_data = p_ble_evt->evt.data.buf[i]; } } void process_data(void) { bool done = false; done = do_slow_processing_of_data(my_data); if (done) { processing_required = false; } } int main(void) { ... for (;;) { if (processing_required) { process_data(); } sleep(); } }

Try this, and let me know if it doesn't work.

Best regards,

Edvin
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 charles_fi over 1 year ago in reply to Edvin

Thanks for the quick response, Edvin, I appreciate it.

We used to do exactly what you describe, but then we also had an issue where nrf_pwr_mgmt_run would return immediately, and the system would refuse to sleep. Moving to polling (NRF_SDH_DISPATCH_MODEL_POLLING) seemed to alleviate that, though we were never sure why. That adventure was detailed here, nrf_pwr_mgmt_run sometimes not sleeping? , with a bunch of red herrings.

I'm going to revisit this issue and move back to a model where we're interrupt-driven, but that will take some time and units in the field are suffering from this defect today, so I need a very quick band-aid that's better than a reset.

My suspicion is that internally, SoftDevice is simply overwhelmed with events and an internal queue it has is getting exhausted. Every other BLE event can be retransmitted by the peer, so it shows up naturally ... eventually. But, disconnects are different, because the peer doesn't retransmit them; the peer is long gone. Is this a correct assumption of what's happening under the hood?

I ask because the "band-aid" I'm considering is to use my radio-notification system to detect the "SD dropped a disconnect event" case and then create a ble disconnect event myself and manually dispatch it to the observers by doing a section-iterator over the observers, the same way that the app-side SoftDevice code does.

Does this sound like a reasonable approach while we take the time to move back to the EGU2 dispatch model?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 charles_fi over 1 year ago in reply to Edvin

Also, another question- how big should "my_data" be in the example code you provided (which, thanks, btw)? What's the maximum rate that SoftDevice can generate events, as a function of the connection interval, connection event duration, ATT MTU, etc?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel