Beware that this post is related to an SDK in maintenance mode
More Info: Consider nRF Connect SDK for new designs
This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Connection failure when sending and receiving data simultaneously with SoftDevice 6.0 and SDK 15

I’m experiencing random connection failure when transferring data (both ways at the same time) between my peripheral (Slave) and central (Master) device.
The problem appeared just after upgrade to the SoftDevice 132 (version 6.0) and SDK (version 15).
There was no such issues with previous versions of SoftDevice (5.0) and SDK (14).

The problem occurs when 2 devices (Master and Slave) starts to stream data bi-directionally with a speed of about 8 kB/second each.
It takes from few seconds up to few minutes when connection fails and both devices starts to report an error (NRF_ERROR_RESOURCES) at the same time.
Furthermore, once this situation happen, the connection on master device seems to be dead completely.
The device is not capable to send/receive notifications anymore (using different characteristic) and is not "aware" of any connections events.
For example, the Slave device can be powered-off and the Master device does not receive “disconnection” event.

There is no issues with connection, pairing or bonding.
Sending notifications, indications or small amounts of data from one device to another seems to be ok too.
The problem starts when devices goes into fast “streaming mode” and larger amounts of data are exchanged.

Both devices are based on nrf52382 and using latest SoftDevice 6.0/SDK 15.
Slave device uses “interrupt dispatch model” (NRF_SDH_DISPATCH_MODEL_INTERRUPT).
Master device uses RTOS and “polling dispatch model” (NRF_SDH_DISPATCH_MODEL_POLLING)

Both devices uses custom Services which are very similar to the “ble_nus” and “ble_nus_c” from the SDK.
Functions used for sending data are: sd_ble_gatts_hvx and sd_ble_gattc_write.
Connection parameters (including negotiated ones) are as follows:
Data length 251 bytes
ATT MTU 247 bytes
PHY set to 2 Mbps
MIN_CONNECTION_INTERVAL 10 ms
MAX_CONNECTION_INTERVAL 20 ms
SLAVE_LATENCY 0
SUPERVISION_TIMEOUT 4000ms
NRF_SDH_BLE_GATT_MAX_MTU_SIZE 247
NRF_SDH_BLE_GATTS_ATTR_TAB_SIZE 1408
NRF_SDH_BLE_GAP_EVENT_LENGTH 400

An observation has been made (but not 100% confirmed):
When sending data in packages by 244 bytes (247-3) the connection seems to be stable.
Occasionally, “NRF_ERROR_RESOURCES” errors appears and this is normal (I know I need to wait for the BLE_GATTS_EVT_HVN_TX_COMPLETE / BLE_GATTC_EVT_WRITE_CMD_TX_COMPLETE events) but connection stays alive for long time.
When data is sent in smaller “packages” (by 160 bytes) the connection fails usually after few seconds.

I’ve tried to use nRF Sniffer to catch the moment when connection fails.
It wasn’t easy, as the tool is not upgraded and has many limitations. However, few screenshots has been made.
First picture shows the moment when Master device stops to respond (pos no. 33071).


Other pictures shows very last packets that has been sent over.

I’ve spent few days to investigate the problem in BLE parameters, memory leaks, RTOS tasks and priorities, stack sizes and in many other places.
Please give a hint for the solution.

PS: The solution isn’t the downgrade to the SoftDevice 132 (version 5.0) and SDK (version 14) as those version has other pairing/bonding issues with latest Android devices.

Parents
  • Hi JRRSoftware,

    I was able to run some tests and i found it very clear that your timer deamon task at priority (2) was starving your dummy task and softdevice task at the same priority.

    #define configTIMER_TASK_PRIORITY ( 2 )

    Remember that in FreeRTOS configuring the kernel is very important to suit your needs. Since you have many "runnable" state tasks at the same time with same priority, FreeRTOS scheduler will always choose one task to run and starve the rest as long as the first task suspends itself. The reason is that you have set the timeslicing of equal priority tasks to 0. Your configuration for this is as below

    #define configUSE_TIME_SLICING 0

    Quoting the text from FreeRTOS documentation

    configUSE_TIME_SLICING

    By default (if configUSE_TIME_SLICING is not defined, or if configUSE_TIME_SLICING is defined as 1) FreeRTOS uses prioritised preemptive scheduling with time slicing. That means the RTOS scheduler will always run the highest priority task that is in the Ready state, and will switch between tasks of equal priority on every RTOS tick interrupt. If configUSE_TIME_SLICING is set to 0 then the RTOS scheduler will still run the highest priority task that is in the Ready state, but will not switch between tasks of equal priority just because a tick interrupt has occurred.

    So if you set the timeslicing to 1 and leave the preemption to 1, then you should not see this problem.

    configUSE_PREEMPTION

    1

    configUSE_TIME_SLICING

    1

    I guess some timing has changed with softdevice in few microseconds with relation to the notification for us to be able to trigger this corner case. Never the less, please choose your task priorities very wisely, they are very crucial part of your application design.

     

  • Hello Aryan,

    We are experiencing a very similar problem.

    In our case our task priorities are correct (the softdevice task is the highest priority).  Our failure is caused by an ISR that is starving the softdevice task.  That ISR is consuming more than 50% of available CPU cycles.

    We are in process of refactoring the design to correct this.  In doing so, questions have come up that we hope you can answer.

    I'll explain...

    We have one prototype ISR implementation that consumes very close to 0% of available CPU cycles.  That version seems to resolve the problem described in this forum.  Alas, this version is not optimal for our application.  We need to do more work in the ISR.

    We have made another prototype ISR implementation that is better for our application.  That version consumes about 5% of the CPU. Alas, it appears that the connection problem is back with this version.

    So here are the questions...

    Is the problem caused by the % of CPU stolen from the softdevice task?  Or is it the duration?  E.g. our ISR implementation that consumes %50 of the CPU only keeps the CPU for about 5 uS, but does so every 10 uS. The implementation that consumes %5 of the CPU keeps the CPU for about 50 uS and does so once every 1 mS or so.

    My guess is that both are bad.  We will refactor once again and put most of the 50 uS processing in a task that is lower priority than the softdevice task.  My guess is this will fix things.

    The question though, is how long is "too long" for a user ISR to preempt the softdevice task?  It appears that 50 uS is too long.

    Comments?

    Thanks!

    Bruce

  • Dear Aryan and Bruce,

    After many hours of investigations I think I managed to solve this problem.

    I say “I think” because I need to carry on further tests to be 100% sure. But so far my program works.

    The most important finding and observation is:

    In some very specific circumstances, the sd_ble_gattc_write() function does NOT throw assertion when it should do. There are situations where assertion seems to be hidden or some how is not propagated properly, which leads to the “connection failure” described in this topic.

    I’ve prepared a sample test program, where you can simulate different scenarios for the above.

    It is actually the 3rd version and the link is available here:

    https://drive.google.com/file/d/1cXsduRdlnS_2xqVA3PXpB3KtY_KcU9nZ/view?usp=sharing

    I found several cases where the “connection issue” might popup:

    1. Task priorities
      Yes, Aryan was right in one of his previous replies. It is important to set task priorities right in the RTOS environment or use the “configUSE_TIME_SLICING” option.
    2. High priority ISR
      BLE stack seems to be very sensitive for a heavy workload inside an ISR function with a high priority. It is very easy to reproduce the problem with hardware timer of which the IRQ priority is set to 0 (highest) and the workload is “just” a loop counting to 400 only!
      Like this:
      for (int i=0;i<400;i++) {};
    3. Low priority ISR
      It is still possible to reproduce the problem with low priority ISR. In my case, the workload loop must count to 10000 (for (int i=0;i<10000;i++) {};)
      and you probably need to wait a bit (while running my test program) until the problem occurs.
    4. Volatile variables
      Probably my most important finding that helped me to solve the issue.
      I found an object (a buffer with control properties) that has been shared among 2 RTOS tasks and was NOT declared as “volatile”. I still do not fully understand why it has such a big impact to the BLE stack (and the sd_ble_gattc_write function) causing the “connection failure” to popup very occasionally and so hard to catch.

    I think there are few questions that still need to be answered. For example, why assertion is not throwed by the sd_ble_gattc_write() when it should be?

    Hopefully, my test program can help to answer them.

    In the “main.c” file you will find the following defines that controls the program:

    #define ISR_WORKLOAD_MODE

    0 - no workload called from ISR (timer) function

    1 - workload called on TESTER side

    2 - workload called on DUMMY (other) side

    3 - workload called on both sides

    #define ISR_IRQ_PRIORITY

    0-7 are the ISR (hardware timer) priorities. Use 0 for high and 7 for low priority. If set to 0, the WORKLOAD_TIMER_LOOP_COUNT can be set to 400, which is enough to reproduce the issue.

    #define ISR_WORKLOAD_EVERY

    This is time in ms that tells how often the ISR workload need to be run.

    #define WORKLOAD_BUTTON_LOOP_COUNT

    Workload “heaviness” when called from button event.

    #define WORKLOAD_TIMER_LOOP_COUNT

    Workload “heaviness” when called from hardware timer.

    By the way, you can replicate the problem by pressing the first button fast enough (on the PCA 10040 dev boards).

    As mentioned before, the test program is based on the “ble_app_att_mtu_throughput” sample code from the SDK.

  • Hello Bruce,

    The softdevice will always run on highest priority for most time critical tasks and the application will never use this priority 

    The app can use next higher priority (priority 2 and 3) as you can see in the the pic below.

    If your app ISR is using these higher priority rather than (priority 5, 6 and 7), then you are blocking non critical softdevice activities. This is legal as long as your connection is not depending on your app replying to the peer's query. And this can be only notified by softdevice to the app using the SWI2 interrupt using Priority 4. If your app is using priority 2 and 3, long enough to block this notification from softdevice, the peer can in some circumstances deem to understand that the connection is either lost or will intentionally terminate the connection as no response.

    It is hard to define "too long" in this. I think that it depends on if there are any procedures that can happen requiring you app to respond within certain time. Encryption could be one example.

  • Hi JRRSoftware,

    good experiment, but you cannot use priorities number 0, 1 and 4 in your test as those will guarantee undefined behavior.

    In any case, I think it is always wise to keep ISR as short as possible.  Were you be able to reproduce the issue without using priorities 0,1 and 4?

  • Yes, you can reproduce the issue using my test program. The IRQ priority for hardware timer is set to 6.
    Try to run my test program without any modifications. All "definse's" are set to values that cause the issue on my dev boards. The settings are as follows:

    #define ISR_WORKLOAD_MODE 2 // simulate workload on DUMMY side
    #define ISR_IRQ_PRIORITY 6 //IRQ priority for harware timer used to trigger the workload ISR
    #define ISR_WORKLOAD_EVERY 10 //trigger workload ISR every 10 ms
    #define WORKLOAD_TIMER_LOOP_COUNT 10000 // workload is a loop counting to 10 000

    Alternatively, you can keep pressing (very fast) first button (index 0) to simulate the workload on both sides (tester/dummy) and if you are lycky enough, you will reproduce the issue that way too. 

    ps: I'm aware that IRQa 0,1,4 can NOT be used. It was rather my hint for Bruce, and confirmation that they really cannot be used Slight smile

    Jack

Reply Children
No Data
Related