Beware that this post is related to an SDK in maintenance mode
More Info: Consider nRF Connect SDK for new designs
This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Connection failure when sending and receiving data simultaneously with SoftDevice 6.0 and SDK 15

I’m experiencing random connection failure when transferring data (both ways at the same time) between my peripheral (Slave) and central (Master) device.
The problem appeared just after upgrade to the SoftDevice 132 (version 6.0) and SDK (version 15).
There was no such issues with previous versions of SoftDevice (5.0) and SDK (14).

The problem occurs when 2 devices (Master and Slave) starts to stream data bi-directionally with a speed of about 8 kB/second each.
It takes from few seconds up to few minutes when connection fails and both devices starts to report an error (NRF_ERROR_RESOURCES) at the same time.
Furthermore, once this situation happen, the connection on master device seems to be dead completely.
The device is not capable to send/receive notifications anymore (using different characteristic) and is not "aware" of any connections events.
For example, the Slave device can be powered-off and the Master device does not receive “disconnection” event.

There is no issues with connection, pairing or bonding.
Sending notifications, indications or small amounts of data from one device to another seems to be ok too.
The problem starts when devices goes into fast “streaming mode” and larger amounts of data are exchanged.

Both devices are based on nrf52382 and using latest SoftDevice 6.0/SDK 15.
Slave device uses “interrupt dispatch model” (NRF_SDH_DISPATCH_MODEL_INTERRUPT).
Master device uses RTOS and “polling dispatch model” (NRF_SDH_DISPATCH_MODEL_POLLING)

Both devices uses custom Services which are very similar to the “ble_nus” and “ble_nus_c” from the SDK.
Functions used for sending data are: sd_ble_gatts_hvx and sd_ble_gattc_write.
Connection parameters (including negotiated ones) are as follows:
Data length 251 bytes
ATT MTU 247 bytes
PHY set to 2 Mbps
MIN_CONNECTION_INTERVAL 10 ms
MAX_CONNECTION_INTERVAL 20 ms
SLAVE_LATENCY 0
SUPERVISION_TIMEOUT 4000ms
NRF_SDH_BLE_GATT_MAX_MTU_SIZE 247
NRF_SDH_BLE_GATTS_ATTR_TAB_SIZE 1408
NRF_SDH_BLE_GAP_EVENT_LENGTH 400

An observation has been made (but not 100% confirmed):
When sending data in packages by 244 bytes (247-3) the connection seems to be stable.
Occasionally, “NRF_ERROR_RESOURCES” errors appears and this is normal (I know I need to wait for the BLE_GATTS_EVT_HVN_TX_COMPLETE / BLE_GATTC_EVT_WRITE_CMD_TX_COMPLETE events) but connection stays alive for long time.
When data is sent in smaller “packages” (by 160 bytes) the connection fails usually after few seconds.

I’ve tried to use nRF Sniffer to catch the moment when connection fails.
It wasn’t easy, as the tool is not upgraded and has many limitations. However, few screenshots has been made.
First picture shows the moment when Master device stops to respond (pos no. 33071).


Other pictures shows very last packets that has been sent over.

I’ve spent few days to investigate the problem in BLE parameters, memory leaks, RTOS tasks and priorities, stack sizes and in many other places.
Please give a hint for the solution.

PS: The solution isn’t the downgrade to the SoftDevice 132 (version 5.0) and SDK (version 14) as those version has other pairing/bonding issues with latest Android devices.

Parents
  • Hi,

     

    As I can see in your sniffer trace screenshot, most of the packagage was not decrypted properly (or packet corrupted). I don't think there is any info in it can be used. 

    Have you made sure you used sniffer v2.0 ? 

    Could you check what exactly throwing NRF_ERROR_RESOURCES , was it sd_ble_gatts_hvx() ? 

    Could you check if there is a disconnected event when the issue occure (on the one having NRF_ERROR_RESOURCES)  ? Was there any assertion ? 

     

     

  • Hi,

    Answering your questions in short:

    1. Yes, I use Sniffer 2.0
    2. Error NRF_ERROR_RESOURCES is occasionally thrown by both functions sd_ble_gattc_write and sd_ble_gatts_hvx, from my Master and Slave device respectively.
    3. When the issue occurs, there is no “disconnection” events on neither of sides (Master nor Slave).
      No assertions either.
      Both programs (Master and Slave) seems to work “normal”, except the BT communication, which seems to be dead from that point.

    My devices are designed to stream audio data, bi-directionally.

    It seems like the issue occur if “too much” data is sent or the BT channel (characteristic) is “flooded”.
    My work environment is very “noisy” (I mean, it is full of radio waves from other devices).

    When my devices goes into “streaming” mode, there are occasional NRF_ERROR_RESOURCES errors on both sides. This is ok, as I have radio-wave noise in the area or push a bit too much data per second. Depending on the connection quality (and parameters), this error occur up to few times a second or not at all. But when the issue happen (most likely when the connection quality is poor and there is lots of NRF_ERROR_RESOURCES errors per second on both sides) then no more data can be sent over from either of sides and NRF_ERROR_RESOURCES error is thrown each time I try to send anything from one side or another. From that point, none of my service characteristics works, both sides. I can’t send notifications, indications or data. Connection seems dead but no “disconnection” event is triggered, even if I power-down my Slave device. However, if Master device will by powered down on that point, the Slave device will receive a “disconnection” event.

    I had many problems with Sniffer 2.0. It works very unstable. The screenshots provided are best I’ve achieved so far. Interesting thing is that all packets regarding pairing, bonding and even single notifications are fine (decrypted correctly). But packets from “streaming mode" seems corrupted, despite the fact it was traced during the same “sniffing” session.

    The observation described in my previous post, regarding packages length is not true. The issue occurs regardless the connection parameters. However, it is far far less likely, if I “mimic” (on my Master device) connection parameters used by “iPhone”:

    Data length 27 bytes (not 251)
    ATT MTU 185 bytes (not 247)
    PHY 1 Mbps (not 2Mbps)

    Please note that my Slave device may work in 2 configurations: with a Phone or with my “Masted device unit” (based on NRF52832). I did not replicated the issue (so far) with the Phone working as a Master device (and with above connection parameters). It seems to occur only on my “Master device unit”.

    I would like to underline the fact that there was no such issues on older versions (SDK 14 and SoftDevice 132 5.0).
    It all starts when I upgraded SDK and SD to the latest versions. There was no other changes in my code since (except changes required by the new API regarding scanning, advertising and other new or changed functions in the SoftDevice / SDK).

  • Hi JRR,

    There could be some bug fixes to avoid deadlock actually causing more deadlock. We will need to reproduce the issue here.

    Could you provide a simplified version can on nRF52DK, sending dummy data that can reproduce the issue so we can test here ? 

    How it usually take for you to see the issue ? if testing in less noisy environment would it make a significant difference ?

  • Hi,

    I’ve managed to reproduce the problem using modified example code from the SDK.

    The source code is available here:

    https://drive.google.com/file/d/11sXPxutC-bAeOTJMpi_ByRIw5-pwclDq/view?usp=sharing

    I’ve modified the ble_app_att_mtu_throughput example to use FreeRTOS.

    A dummy RTOS task emulates a workload. In this case it is a loop counting to 100000 every 10 ms (roughly). It is more likely to trigger the issue if there is a serie of “short” workloads. If there are “heavy” tasks, but not triggered so often, the issue appears less likely.

    To replicate the problem,

    1. Put the JRR_DeadlockTest folder next to the ble_app_att_mtu_throughput (inside SDK_15\examples\ble_central_and_peripheral\experimental).
    2. Follow the steps described in the instruction for the ble_app_att_mtu_throughput example.

    Additional log lines shows how many times sending data ended successfully or threw an NRF_ERROR_RESOURCES error. After few seconds you should see that no data has been sent successfully (all ended throwing an error). A screenshot below shows this situation:

    Red arrow shows the point where the BT connection failed. Each time it happens in slightly different point. Sometimes sooner sometimes later. No assertion and no disconnection event will be triggered when this situation happen (on TESTER device).

    The key facts to reproduce the problem are:

    1. Use FreeRTOS,
    2. Emulate workload in a task that is triggered fairly often,
    3. Send data using timer (or a task). Do not send data as a response for the BLE_GATTS_EVT_HVN_TX_COMPLETE event
    4. Try to “flood” the BT characteristic by sending too much data.
    5. Sending data bi-directionally is not necessary (as I thougt originally). It is possible to reproduce the problem by sending data only one way.

    I hope it helps. Do you have any suggestions so far?

Reply
  • Hi,

    I’ve managed to reproduce the problem using modified example code from the SDK.

    The source code is available here:

    https://drive.google.com/file/d/11sXPxutC-bAeOTJMpi_ByRIw5-pwclDq/view?usp=sharing

    I’ve modified the ble_app_att_mtu_throughput example to use FreeRTOS.

    A dummy RTOS task emulates a workload. In this case it is a loop counting to 100000 every 10 ms (roughly). It is more likely to trigger the issue if there is a serie of “short” workloads. If there are “heavy” tasks, but not triggered so often, the issue appears less likely.

    To replicate the problem,

    1. Put the JRR_DeadlockTest folder next to the ble_app_att_mtu_throughput (inside SDK_15\examples\ble_central_and_peripheral\experimental).
    2. Follow the steps described in the instruction for the ble_app_att_mtu_throughput example.

    Additional log lines shows how many times sending data ended successfully or threw an NRF_ERROR_RESOURCES error. After few seconds you should see that no data has been sent successfully (all ended throwing an error). A screenshot below shows this situation:

    Red arrow shows the point where the BT connection failed. Each time it happens in slightly different point. Sometimes sooner sometimes later. No assertion and no disconnection event will be triggered when this situation happen (on TESTER device).

    The key facts to reproduce the problem are:

    1. Use FreeRTOS,
    2. Emulate workload in a task that is triggered fairly often,
    3. Send data using timer (or a task). Do not send data as a response for the BLE_GATTS_EVT_HVN_TX_COMPLETE event
    4. Try to “flood” the BT characteristic by sending too much data.
    5. Sending data bi-directionally is not necessary (as I thougt originally). It is possible to reproduce the problem by sending data only one way.

    I hope it helps. Do you have any suggestions so far?

Children
  • Thanks for providing us the example code. We unfortunately under reduced staff, so we couldn't try to test this right away. 

    But could you try at line 239 in SDKv15\external\freertos\portable\CMSIS\nrf52\port_cmsis_systick.c to change #if 0  to #if 1 ? This is a bug fix added to reduce power consumption but may have a side effect. 

    You mentioned NRF_ERROR_RESOURCES  show on both side, what would throw that error on the central side ? Could you findout which function throwing that error ? 

    You mentioned the central doesn't throw disconnected event even though you turn off the client. It sound like the central actually got an assertion/hardfault. Have you check if it still functional normally ? like the main loop still executed?

    Also, please make sure that you are not doing something in APP_HIGH priority for too long, and too often, doing that you may block the softdevice from throwing events, including DISCONNECTED event. 

     

  • Hi again,

    Answering your questions:


    1. Unfortunately, your suggestion regarding the line 239 in the port_cmsis_systick.c does not make any difference
    2. There are 2 functions that may throw the NRF_ERROR_RESOURCES error in this case:
      sd_ble_gatts_hvx and sd_ble_gattc_write. However, as mentioned in my previous post, bi-directional data transfer is not necessary to reproduce the problem. It is enough to send data one way only. And this is what I do in the second version of my test case (link below).
    3. When the issue occur on the device (doesn’t matter if on Master or Slave) it starts to be “blind and deaf” for BT events, data and notifications. However, other functions seems to work normally (there is no assertion, no hard fault, no stack overflow etc). You can reproduce this using my test case and watch log lines still printed from both sides.
    4. Also, there are no high priority tasks in the provided examples. Furthermore, changing priorities for the dummy task (simulation workload) and SDH tasks does not make any difference either.

    Here is the simplified code example (JRR_DeadlockTest_v2) to reproduce the problem:
    drive.google.com/.../view

    You can easily compare this code with original ble_app_att_mtu_throughput ad see there are only few changes. I use 2 dev boards (PCA 10040) to reproduce the problem. Usually, it takes few seconds from the data transfer start but sometimes need to run the test again.

    The simulated workload is a task with a loop counting to 16000 every 5ms.

    Any other suggestions?

  • Hi, 

    I just tried it here and can reproduce the issue. We will start the investigation and get back to you when we have an update. 

  • Hi,

    I did further investigation and the problem is somehow related to RTOS. It occurs only when NRF_SDH_DISPATCH_MODEL is set as NRF_SDH_DISPATCH_MODEL_POLLING.

    Moreover, once SD_EVT_IRQHandler() is triggered and softdevice_task() is handled immediatelly - there is NO such issue.
    But if other task is doing something and softdevice_task() cannot be handled immediatelly, then sometimes (but not always) the above isse is happening.

    I'm not sure if I can use NRF_SDH_DISPATCH_MODEL_INTERRUPT together with RTOS. For now it seems to be the only solution. What is your advice?

Related