This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Random BLE Disconnects / NRF52832 stops responding to Central

We are currently experiencing a strange BLE Disconnect issue that I haven't been manage to sort out for a long time now.

We have a device that acts as a peripheral that is connected to an app over BLE. The connection will be established normally, and keep running without any problems. On some of the devices however after a period of 10-25 minutes on avarege the BLE connection will die without any indication for a reason.

Some details:

  • We have a custom borad based on NRF52832
  • We are using SDK V15.3
  • We are using SoftDevice S132 6.1

The general symptoms are as follows:

  • After a certain random period of time (usually from 10 to 25 minutes) the NRF52832 board will stop responding to the packets coming from central
  • The central will try to retransmit the packet 23-24 times (this is in line with our connection params latency = 25)
  • The device will enter advertising mode
  • The app will connect again exchange some information discover services etc. and then the pattern will repeat.
  • The connection will be there for a couple of seconds, some data exchange will happen but afterwards the device will again fail to respond to 23-24 packets from central and go into advertising mode
  • repeat ad infinitum with the periods of sustained connection becoming shorter and shorter.
  • after the first several disconnects the connection will be able to be reestablished for ~20-30 seconds or so
  • later on the connected phase will decrease to virtually nonexistent - the device will  stop responding right after the CONNECT_IND

some additional remarks:

  • There doesn't seem to be a particular request from the central that we're not handling as the sniffer logs don't seem to show anything relevant
  • most often the device will fail to respond to an EMPTY_PDU but it will also fail on: LL_FEATURE_REQ, LL_PHY_REQ, LL_LENGTH_REQ, Sent Read By Group Type Request, Sent Read By TypeRequest, Sent Find Information Request, Sent Exchange MTU Response, and the list goes on
  • There's a suspicion that this issue is more prevalent on Iphones, however a bigger portion of our user base has an iphone so this is probably just statistics
  • The issue is present on our production app, on our internall test app as well as nrf_connect so it's extremelyunlikely that it's comming from the app side of things
  • The firmware isn't going through a softreset when this issue occurs as we have a custom app_error check that log's errors to ram (unless SD asserts silently somewhere) + even if there was a silent reset there are certain variables which will get zeroed out on reset and we have ways of reading them out independant of BLE (3g modem + mqtt server)
  • A pinreset will in general resolve this issue so if there was indeed a reset it would be highly likely that it would clear this BLE Loop of Death (this is our cute nickname for this behavior)

Things we have tried:

  • We had a suspicion that it had something to do with PHY requests - where on newer iphones it would try to negotiate for 2M PHY but forcing 1M PHY didn't solve the issue
  • We've experimented with several different connection params, and though this had some effect on the quality & range of the signal it didn't affect this particular issue whatsoever our connection params are will within the spec for iOS which seems to be way more stringent that Android's spec
  • We disabled peer manager and this didn't affect the behavior

It's been a struggle to capture this behavior while sniffing packets with the NRF dongle (since we're still not sure exactly how to reproduce this issue) but we've finally been able to do it. sadly though the wireshark logs didn't make the reason for this behavior directly evident to me.

Here are some snippets of what the sniffer sees during one of these episodes, as well as the full wireshark log for deeper investigation.

1#

2#

3#

The whole log, The "Loop of Death" starts around 14:54:30 and it will keep disconnecting untill almost the end of the log where a pinreset has been triggered:

test_324_1062.pcapng

In addition to this here's a screenshot from our internal fleet management system which shows the RSSI of the connection and the dynamics of this behavior. BLE is shown in yellow and the value axis for it is on the right - The RSSI is not stellar, hovering around -65dB but I see no reason why this should be the reason for this behavior:

Any tip or suggestion would be most appreciated as I'm completely stumped. The most resent discovery that some Iphones would negotiate for 2M PHY was really promising but after implementing this change without success I really don't know how to continue my investigation as the wireshark files didn't reveal any new & interresting info to me...

I'd be more than happy to provide additional info if it help clarify what the hell is going on

Parents
  • While writing this post the following has occured to me, 

    The firmware isn't going through a softreset when this issue occurs as we have a custom app_error check that log's errors to ram (unless SD asserts silently somewhere)

    I tried to go over the entire codebase (which I have inheritted in this somewhat broken state) to find if there were any instances of SoftDevice calls where the return values were not handled properly, And indeed, I deiscovered that there were. Namely all the returns from  sd_ble_gatts_hvx() were being ignored.

    After adding the relevant APP_ERROR_CHECK calls the firmware promptly started dying with error code 19 (NRF_ERROR_RESOURCES). Having done a bit of investigating the issue seems to be that the TX queue of the SD fills up (in particular under not so ideal RSSI conditions). My theory is that attempts to retransmit data that didn't get ACK'd prevent further updates on the characteristics.

    My investigations have pointed me in the direction of two variables: ble_gatts_conn_cfg_t::hvn_tx_queue_size which is the effective TX queue size & NRF_SDH_BLE_GAP_EVENT_LENGTH which seems to be that value that SD looks at when internally configuring the aforementioned queue size.

    I have the following follow-up questions:

    what is the correct way of querying the hvn_tx_queue_size parameter? I qould expect there to be something along the lines of ble_gatts_conn_cfg_get but I couldn't find anything like that. My next attempt was to use sd_ble_opt_get() but I haven't managed to find the magical combination of arguments to get it to work. There is a method for checking queue utilization described in the documentation for sd_ble_gatts_hvx():

    • Store initial queue element count in a variable.
    • Decrement the variable, which stores the currently available queue element count, by one when a call to this function returns NRF_SUCCESS.
    • Increment the variable, which stores the current available queue element count, by the count variable in BLE_GATTS_EVT_HVN_TX_COMPLETE event.

    It's not very transparent how the queue size is set so i'm not sure what value to use for initial queue element count.

    The second question is if there is a way to mitigate this problem using the NRF_SDH_BLE_GAP_EVENT_LENGTH define? our system seems to be working just fine under 95% of conditions with our current value of 12 (corresponds to 15msec of on air time as far as i understand). how do I account for the other 5% of conditions where the queue is getting overloaded? or better yet how do i calculate the optimal value for this parameter.

    FYI under the heaviest load we are transmitting about 1122bytes of data (including the preamble and all the overhead and characteristics being split into 2 etc...) with our current connection params of 60msec max_conn_interval and 15msec NRF_SDH_BLE_GAP_EVENT_LENGTH this should be well within the bandwidth. we're using 1M PHY and if i understand the connection parameters correctly this should correspond to 15/60 = 1/4 of max theoretical data rate = 250kBit/sec = 31.25KBytes. if my calculations are collect we're using 3-4% of the theoretically available bandwidth but we still encounter problems...

    each of our characteristics would have to be retransmitted on avarage maybe >25 times for the SD to start becoming overwhelmed

    I'd appreciate any suggestions as well as any corrections to my calculations...

  • Hi Nikolozka, 

    I don't think over queueing the notification can cause such issue. The softdevice will just ignore (return NRF_ERROR_RESOURCES) then the queue is full. 

    From the sniffer trace it didn't suggest that the softdevice was under much stress. 
    I can see that the Slave sometimes stopped responding for a few packets, I assume it's the Slave Latency setting you chose ? 

    Have you tried to debug the peripheral and check does it has any assertion  ? Or check what's the disconnection reason is when it disconnected (BLE_GAP_EVT_DISCONNECTED) ? I suspect that it got assertion and reset for some reason. 

    Have you tried to test the same application on a nRF52 DK ? (so that we can rule out any issue with the hardware) ?

    Modifying BLE_GATTS_HVN_TX_QUEUE_SIZE_DEFAULT, wont make it transfer more data. You can try to increase NRF_SDH_BLE_GAP_EVENT_LENGTH but by default it's 320 and should cover the entire connection interval that you have. 

  • Hi Niko, 

    Disconnection reason 0x08 would suggest that it could be an issue with the oscillator or the radio hardware. What I can think of is that the peripheral gradually goes out of sync or the radio hardware doesn't work as it should.

    What you can do to test is: 

    1. Try testing a very simple application on the same board. ble_app_hrs for example and let it run and check if disconnection happens or not. 

    2. Try testing your firmware on a nRF52DK , to verify if it's an hardware issue or not. 

    Which central device are you using ? Have you tried to test the board with different central for comparison ?

  • We're using various phones as centrals. As I mentioned earlier, the issue has been quite hard to replicate under test conditions so we're relying on production devices in the field for data collection. a couple of our customers have our internall test app that logs all the data that we make available over BLE. 

    So central devices that the issue is confirmed to have happened on is a mid-range Samsung Galaxy (not 100% sure on the model), Iphone12, Iphone X, Iphone 11. Most Iphones are running some version of iOS 14 generally up to date. It seems like it is independant of the central though. There is no particular pattern to which central is better at triggering this.

    1. Thanks for the ble_app_hrs example suggestion, this would be an interesting test that would rule out other bits of our FW malfunctioning. I might also try stripping out our code to leave only bluetooth related bits. But if the issue persists with ble_app_hrs example I'm assuming this would indicate a hardware / sync issue right?

    2. testing on the DK is a bit problematic as we rely on a lot of peripherals in our FW so it would be an extremely stripped down version of our FW, which I suppose would crash only if there is something misconfigured about our connection parameters for instance. 

    If it is indeed a sync issue, would it be possible to improve this?

    We are using internal RC with CTIV=16 TEMP_CTIV=2 & Accuracy=500ppm. would tweaking these parameters yield some improvements? 

  • Please do the test and let us know the result. 
    From what I can tell from the log and from the disconnected reason, it suggesting that there could be RF noise or bad radio performance or out of sync oscillator would cause the issue. 

  •  i  am facing same problem  after connecting its disconnecting automatically from central device ,i tried  simple example the thing i noticed is its disconnecting  even with simple example but after 1-2 min its disconnecting.

  • Hi Madhuru, 
    Please create a new ticket for your issue. Please try using the sniffer to get more insight of the connection and send us the trace so we can analyze. 

Reply Children
No Data
Related