This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Connection failure - device gets stuck sending a notification in an infinite loop

I am working on an nRF52832 BLE peripheral prototype which sends bulk data to a central as bursts of large notifications. I am encountering an error where the connection gets stuck - the peripheral keeps trying to send the same notification forever in an infinite loop, and calls to sd_ble_gatts_hvx return NRF_ERROR_RESOURCES. The devices will remain in this state forever - neither side crashes, and if the connection is lost the peripheral sees the disconnect event and resumes normal operation.

The problem is similar to the one in these questions:

https://devzone.nordicsemi.com/f/nordic-q-a/37204/getting-error-nrf_error_resources-when-using-more-then-one-notification-and-the-connection-crashes/145134#145134

https://devzone.nordicsemi.com/f/nordic-q-a/36238/connection-failure-when-sending-and-receiving-data-simultaneously-with-softdevice-6-0-and-sdk-15/140267

https://devzone.nordicsemi.com/f/nordic-q-a/39538/sd_ble_gatts_hvx-nrf_error_resources-in-do-while-loop

My prototype application is running on:

  • PCA10040 v1.1.1 board with nRF52832 rev 1
  • SDK is nRF5_SDK_15.2.0_9412b96
  • SoftDevice is S132 v6.1.0
  • Toolchain is GCC ARM Embedded 6.3 2017-q2-update running on Windows 10

The application uses FreeRTOS. One task handles BLE stack events, and another task sends out notifications. It calls sd_ble_gatts_hvx in a loop until it returns NRF_ERROR_RESOURCES, and then blocks on an RTOS event flag. When the BLE_GATTS_EVT_HVN_TX_COMPLETE event arrives from the BLE stack, the event flag is set and it continues sending.

I have been able to reproduce the problem reliably, but only under very specific conditions: the central device is a nRF52 Dongle (PCA10059 with nRF52840), with pc-ble-driver-js 2.4.3 and a nodejs test script that I wrote, and the J-Link debugger is NOT connected to the PCA10040 peripheral device. The problem does NOT occur if any one of these changes are made to the test setup:

  • Use a PCA10040 board to run the connectivity firmware instead of the PCA10059 dongle
  • Put a long enough delay in between each call to sd_ble_gatts_hvx (40ms worked)
  • Use the nRF Connect application instead of my nodejs test script as the central, with the PCA10059
  • Use nRF Connect on an Android phone as the central
  • Connect J-Link debugger to the peripheral

The effect of the J-Link is strange but repeatable: if the PCA10040 is plugged into the PC USB with no application connected to it, then the problem will occur. But if any J-Link application is running and connected to the debugger then the problem no longer occurs and everything works perfectly. This works with JLink.exe, JLinkRTTViewer.exe, and JlinkGDBServer.exe. I tried changing lots of random things in the project to try to figure out what effect connecting the debugger might have, but nothing worked: disable RTT logging, enable UART logging, request LFCLK and HFCLK, set CONSTLAT power mode, use SYSTICK instead of RTC for FreeRTOS, disable tickless idle, do not enable SCB_SCR_SLEEPDEEP. I assume there must be some kind of race condition somewhere and connecting the debugger makes some slight change to the timing?

I have attached a minimal test program that demonstrates the problem:

error_demo.zip

The peripheral is in the error_demo directory. To run, copy that directory to nRF5_SDK_15.2.0_9412b96\examples\ble_peripheral in the SDK and run "make flash" from nRF5_SDK_15.2.0_9412b96\examples\ble_peripheral\error_demo\pca10040\s132\armgcc (after flashing S132 SoftDevice). The test app advertises as a device named "ErrorDemo". When a central connects, after a 2 second delay it will start sending notifications from its custom service's characteristic if notifications are enabled. It sends a couple hundred notifications that contain about 50k of data total.

The central is in the error_demo_nodejs_client directory. I am running this under nodejs v8.12.0 (64-bit). First run "npm install". Then, edit errdemo.js. At the bottom fill in the COM port of your nRF52 dongle, and MAC address of your peripheral, then run "node errdemo.js". It will connect to the peripheral and print a debug message to the console for each notification it receives. It prints "Got the last notify" if it got all of them.

Using the version of the sniffer from this post I was able to capture logs of the failure:

https://devzone.nordicsemi.com/f/nordic-q-a/39439/nrf-sniffer-unable-to-track-packet-data-length-changes

This log shows the failure occurring:

pca10059_node_client_fail.pcapng

It starts out sending a few data notifications that are received successfully, the first is packet 1855 and the last good one is 1897. Then the problematic one is sent at 1899. The first odd thing happens at 1900, the master sends an empty PDU with SN 0 and NESN 1, maybe this should have been SN 0 and NESN 0? After this point the peripheral just keeps sending that notification from 1899 in an infinite loop.

This log shows everything working successfully, a couple hundred notifications are sent and then the connection becomes idle.

pca10059_node_client_success_with_jlink.pcapng

The only difference in the test setup between this log and the previous one is that JLinkGDBServer.exe was running and connected to the peripheral board.

  • just found one race condition that could cause deadlock in the application.

    Looking at your description it seems possible that this case is hit by it if we manage to loose one softdevice notifying an event. 

    Please apply the solution given in this post and see if the race condition is removed.

  • It still behaves exactly the same with that fix. When I was experimenting yesterday I had also tried replacing vTaskSuspend(NULL) with vTaskDelay(1) in softdevice_task and that also didn't make any difference.

    The application isn't deadlocked, because even after it gets into the bad state where the connection is stuck, if the connection is lost the application does get the BLE_GAP_DISCONNECTED_EVT event from the stack.

  • OK, if  you get a disconnected event, then this is definitely different. I still recommend you not to use suspend/resume as there is definitely a corner case there and you should not take chances. 

    I will give it a try with your project to see if i can reproduce it, if I did, then it is easier for me to debug.

  • In the post i mentioned to replace xTaskResumeFromISR/vTaskSuspend in softdevice_task with vTaskNotifyGiveFromISR/ulTaskNotifyTake. Did you do that? Because I cannot reproduce any disconnects with this. 

  • Hello,

    I did apply the nrf_sdh_freertos.c bug fix to the SDK and that did not fix the problem.

    I have also encountered the same problem again, this is what has changed since my previous report:

    • Running my application on a custom nRF52832 board instead of PCA10040
    • Application is sending actual data read out of external flash memory instead of dummy data like my simplified error_demo example
    • SDK is updated to nRF5_SDK_15.3.0_59ac345 (now includes the nrf_sdh_freertos.c bug fix)
    • SoftDevice is S132 v6.1.1
    • Toolchain is GCC ARM Embedded 7 2018-q2-update running on Windows 10
    • pc-ble-driver-js is v2.5.1 running under nodejs 64-bit v10.15.1

    The problem still only occurs when the debugger is disconnected. Connecting the J-Link and launching the RTT Viewer fixes the problem.

    I have attached two more nRF Sniffer logs that show the link layer getting into a bad state:

    fail1.pcapng

    In fail1.pcapng, everything is working properly until packet 11003. Before that point there are a lot of lost packets being re-sent but they always get through eventually.

    Packet 11004 is the first time the slave sends the packet (notification whose payload starts with 0x3a) which will eventually be re-sent in an infinite loop. It has SN=1 and NESN=0.

    In packet 11005 the master NACKs it by responding with SN=1 and NESN=1.

    In packet 11006 the slave resends it with SN=1 and NESN=0.

    In packet 11007 the master ACKs it by responding with SN=0 and NESN=1.

    In packet 11008 the slave resends it again even though the master just ACKed, and from this point onwards the slave just keeps sending the same notification again in an infinite loop.

    The slave keeps sending the same packet, with SN=1 and NESN toggling between 0 and 1. The master will not ACK the packet by responding with NESN=0. It always responds with NESN=1 indicating that it wants the slave to resend the packet.

    fail2.pcapng

    In fail2.pcapng everything works until packet 3106.

    Packet 3106 is the first time the slave sends the notification (payload begins with 0x0e) which will be re-sent in an infinite loop.

    In packet 3107 the master NACKs it.

    The sniffer shows packet 3108 as invalid CRC. It looks like a bit got flipped in the ATT opcode changing it from 0x1b to the invalid value 0x1a.

    In packet 3109 the master NACKs packet 3108 (?)

    In packet 3110 the slave retransmits packet 3106 again and from this point onwards the devices are stuck in an infinite loop.

    From packet 3110 onwards the slave keeps sending the same packet over and over again, always with SN=0 and toggling NESN between 0 and 1. The master keeps responding, each time with its SN matching the previous slave packet's NESN. But it never acknowledges the packet by changing its NESN to 1, it always sends NESN=0 telling the slave to re-send it.

    In both these cases it looks like the problem exists within the master, not the slave. The slave keeps sending the same packet over and over again, which looks valid according to the sniffer. And in the packets it sends, the master is correctly setting its SN bit according to the last NESN bit from the slave, so it seems like the master must be seeing the packets from the slave. But the master will not acknowledge the packet by setting its NESN bit to the opposite of the SN bit in the packet the slave keeps sending repeatedly.

    Maybe the problem is similar to this one? https://devzone.nordicsemi.com/f/nordic-q-a/17797/sniffer-detects-ll-infinite-loop

Related