Beware that this post is related to an SDK in maintenance mode
More Info: Consider nRF Connect SDK for new designs

ble_advertising doesn't always start or restart after disconnect?

nRFSDK v17.0.1, S140 v7.2.0

We're using the ble_advertising nRF SDK module to control advertising. We allow one connection at a time, and as soon as the peer disconnects, we want to immediately restart advertising. We try to advertise forever; we never want to time out or stop. We're not using Mesh or anything else, just plain and boring BLE.

We're seeing units in the field that simply "go dead" on BLE, but they are still otherwise running (they connect to the internet via WiFi and/or LTE on different coprocessors). I added a bunch of diagnostics, including the radio_notification module. I added a safeguard that counts radio notification events- if it ever sees less than 30% of the expected number of radio notification "active" events than it should (based on our advertising + connection interval settings) over a 10-minute period, it logs how many it saw and then reboots.

The data came in! Each of these event counts display the number of radio notification callbacks we had over the past 10 minutes that led to "rescue" reboots.

radio: 250/1098 events
radio: 274/1098 events
radio: 233/1098 events
radio: 322/1098 events
radio: 236/1098 events
radio: 289/1098 events
radio: 238/1098 events
radio: 0/1098 events
radio: 295/1098 events
radio: 327/1098 events
radio: 0/1098 events

So, we're seeing two different phenomena here: the assertions that fired with close to 30% (e.g. 322/1098) are likely advertising simply not restarting after losing a connection. The assertions with 0 events mean that somehow SoftDevice never started advertising at all?

Here's how we're initializing advertising:

ble_advertising_init_t init;
memset(&init, 0, sizeof(init));

ble_uuid_t adv_uuids[] = { { OUR_SERVICE_UUID, our_uuid_type } };
init.advdata.flags = BLE_GAP_ADV_FLAGS_LE_ONLY_GENERAL_DISC_MODE;
init.advdata.uuids_complete.uuid_cnt = ARRAY_COUNTOF(adv_uuids);
init.advdata.uuids_complete.p_uuids = adv_uuids;
init.advdata.include_appearance = true;
init.srdata.name_type = BLE_ADVDATA_FULL_NAME;
init.config.ble_adv_on_disconnect_disabled = false; // restart after disconnect
init.config.ble_adv_fast_enabled = true;
init.config.ble_adv_fast_interval = 874; // 546.25ms
init.config.ble_adv_fast_timeout = BLE_GAP_ADV_TIMEOUT_GENERAL_UNLIMITED;
init.evt_handler = on_adv_evt;
init.error_handler = on_adv_error;

memset(&s_adv.adv_params, 0, sizeof(s_adv.adv_params));
APP_ERROR_CHECK(ble_advertising_init(&s_adv, &init));

And then we start advertising:

APP_ERROR_CHECK(ble_advertising_start(&s_adv, BLE_ADV_MODE_FAST));

We have no calls to ever stop advertising, and we never turn off BLE via SoftDevice or shut down SoftDevice itself. Our error handler logs the error and reboots the nRF52840. Our event handler logs, and if the event is BLE_ADV_EVT_IDLE, reboots the nRF52840. We haven't seen any of these reboots in the logs yet, so it looks like when advertising starts, it works, and nobody stops it intentionally.

Does anyone have any idea where I could look to start diagnosing this? Are there any errata in our version of SoftDevice or the nRFSDK that might be in play here?

Thanks in advance,

Charles

Parents
  • Hello,

    It may be that the DRGN-15619 is what you are actually seeing, but it should be possible to work around this.

    First of all, you should try to minimize the amount of time you spend in your interrupts. especially the high priority interrupts, such as the softdevice events. If you are using these events to process the data, try to instead copy the data to a buffer, then use the main loop to process the data. This way, the CPU will always be ready to handle new softdevice interrupts.

    Pseudo code:

    volatile bool processing_required = false;
    uint8_t my_data[];
    
    
    softdevice_interrupt(p_ble_evt_t * p_ble_evt)
    {
        processing_required = true;
        for (uint8_t i=0; i<p_ble_evt->evt.data.length; i++)
        {
            my_data = p_ble_evt->evt.data.buf[i];
        }
    }
    
    void process_data(void)
    {
        bool done = false;
        done = do_slow_processing_of_data(my_data);
        if (done)
        {
            processing_required = false;
        }
    }
    
    int main(void)
    {
        ...
        for (;;)
        {
            if (processing_required)
            {
                process_data();
            }
            sleep();
        }
    }

    Try this, and let me know if it doesn't work.

    Best regards,

    Edvin

  • Also, another question- how big should "my_data" be in the example code you provided (which, thanks, btw)? What's the maximum rate that SoftDevice can generate events, as a function of the connection interval, connection event duration, ATT MTU, etc?

  • charles_fi said:
    My suspicion is that internally, SoftDevice is simply overwhelmed with events and an internal queue it has is getting exhausted.

    That is correct. There is a scheduler in the softdevice trying to manage all the events, and discard the "less important" if the queue fills up. The bug then, is that the disconnected event sometimes are mistakenly ignored and not passed to the application.

    There is no limit to what you should set your "my data" buffer to. It depends on your application. 

    I guess there is a limit to how large the buffer coming in a notification or read request/command, but that should equal your MTU (probably maximum 247 bytes). But I don't know whether you are able to process this before the next buffer is coming, so maybe you need a double buffer? Triple?

    In the end, if you are not able to process your data faster than it is coming in, you will fill up eventually, whether you have 100 bytes buffer or 100 KB buffer. 

    charles_fi said:
    I'm considering is to use my radio-notification system to detect the "SD dropped a disconnect event" case and then create a ble disconnect event myself

    That sounds like a good idea. You should be able to detect a disconnect using the radio notifications in something like a watchdog approach. I am not sure about generating the disconnected event, though. You need to test it. Make sure that the softdevice's internal state is updated, so you don't get weird behavior. I don't have anything specific to check, but check whether you are able to start scanning/advertising, or if the softdevice already thinks you are in a connection. My prediction is that the softdevice thinks it is already disconnected, but make sure this is the case. 

    Best regards,

    Edvin

  • Thanks for the response and all of the information, Edvin; it's very helpful and I think we have a good plan here.

    My hypothesis is that SD knows that it's disconnected, because it's stopped scheduling its connection events and the BLE radio has no activity.

    But, yes please, I'd love to hear confirmation that SD is in an internally consistent and "sane" state in the face of event-queue exhaustion / application-side backpressure. That would inform us about whether it's ok to continue, or whether we have to reset the nRF52.

    Best,

    Charles

  • I asked our SoftDevice team, and they said that the softdevice may actually be stuck in a different state. If you try to start advertising with a connectable advertisement, it may fail and cause an assert. 

    To be safe, you can start an unconnectable advertisement to "kick" the softdevice's state machine, and then stop the unconnectable advertisements again, before starting the connectable advertisements. That should work, if you want to avoid rebooting the device. 

    Best regards,

    Edvin

  • Heya Edvin- thanks a lot for this information; it's exactly what we were hoping to discover.

    If I could, I'd like to ask one more question about this scenario- I'm fishing a little bit here, but I'm curious: is it possible that one of the ways the SoftDevice could manifest being "stuck" would be a refusal to sleep when nrf_pwr_mgmt_run is called?

    I wonder if it's at all possible if the following chain of events could be leading to our mysterious "CPU never sleeps and the battery dies" bug:

    1. SoftDevice generates BLE events.

    2. Our application sometimes doesn't service these events quickly enough.

    3. SoftDevice's internal BLE event queue is overwhelmed, SoftDevice does its best to recover but fundamentally there's not much it can do in the face of exhaustion. It gets "stuck" in one of the states that you're alluding to. Perhaps it forgot that it failed to enqueue an event but still has a flag set that prevents it from sleeping because it thinks it has work to do?

    4. Our application is blissfully ignorant of the trauma we've subjected SoftDevice to, and we continue through our main firmware loop, calling nrf_pwr_mgmt_run as soon as we're done with our work.

    5. nrf_pwr_mgmt_run calls into sd_app_evt_wait(), but SoftDevice has a codepath that has it return either without doing a WFE/SEV/WFE, or does something that otherwise inhibits the sleep.

    Could I ask you to liaise with the SoftDevice team about whether something like this is theoretically possible? I'd love to understand if we're hunting two unrelated bugs, or if our failure to handle SoftDevice events quickly enough could cause this no-sleep behavior as a downstream side effect?

    Best,

    Charles

  • charles_fi said:
    I wonder if it's at all possible if the following chain of events could be leading to our mysterious "CPU never sleeps and the battery dies" bug:

    Difficult to say without having seen more of the implementation (of your application). I have seen many attempts of trying to handle sleep and scheduling by the customer, which fails because things don't work as they expect. 

    Does this phenomen only occur when the missing disconnected issue occurs?

    Can you show me what your main loop looks like? What else do you do, other than sd_app_evt_wait()?

    How do you determine that the device doesn't sleep properly?

    I have forwarded your question to our softdevice team, but I am still waiting for a reply. So in the meantime we can start looking into the parts that I have access to.

    Best regards,

    Edvin

Reply
  • charles_fi said:
    I wonder if it's at all possible if the following chain of events could be leading to our mysterious "CPU never sleeps and the battery dies" bug:

    Difficult to say without having seen more of the implementation (of your application). I have seen many attempts of trying to handle sleep and scheduling by the customer, which fails because things don't work as they expect. 

    Does this phenomen only occur when the missing disconnected issue occurs?

    Can you show me what your main loop looks like? What else do you do, other than sd_app_evt_wait()?

    How do you determine that the device doesn't sleep properly?

    I have forwarded your question to our softdevice team, but I am still waiting for a reply. So in the meantime we can start looking into the parts that I have access to.

    Best regards,

    Edvin

Children
  • Hello again,

    The feedback from our SoftDevice team was that the missing disconnected event should not affect sd_app_evt_wait(), so we need to look into other possible reasons.

  • Ah, I wonder if the behavior could be caused by missing any _other_ softdevice events. If we're missing disconnected events, then we're probably missing other events as well in other scenarios.

    I'm mainly pursuing this line of questions because when we moved originally from interrupt dispatch to app-scheduler dispatch for SD BLE events, we saw the "CPU never sleeps" bug almost vanish. It's still there sometimes, though, which makes me think that the root cause might be related to spending too much time in the SD interrupt event handler, or too much time between servicing SD BLE events.

    Anyway, we're about to ship an update that drains all events into a 4KB circular queue for the app to service when it can. If that fixes all of the issues, then it might not matter as much what the SD-internal root cause was. If that doesn't fix our issues, I'll be happy to open a separate ticket with evidence and more relevant questions :)

Related