nRF5340 LE disconnection Issue.

AKV 11 months ago

Hi,

We are using nRF5340 DK and the nRF Connect SDK Version 2.6.1.

We are experiencing an issue with disconnections. We use bt_conn_ref() when receiving the connected event and bt_conn_unref() when receiving the disconnection event.

However, sometimes after a disconnection, when we retry, we encounter the error: "bt_conn: bt_conn_exists_le: Found valid connection (0x20001a18) with address FF:F5:55:5A:D6:6A (random) in disconnected state".

Why does this issue occur even after the disconnection event has been processed?

If we call bt_conn_unref() twice during the disconnection event as shown in the below code (essentially incrementing once upon connection and decrementing twice upon disconnection), the issue does not occur.

void connected(struct bt_conn *conn, uint8_t conn_err){
	bt_conn_ref(conn);
}
void disconnected(struct bt_conn *conn, uint8_t reason){
 	bt_conn_unref(conn);
 	bt_conn_unref(conn);	
}

Could this behavior be due to an internal counter issue?

Parents

0 Vidar Berg 11 months ago

Hi,

I'm not aware of any known issues where the Bluetooth host fails to release its own reference. Are you developing a peripheral or central application?

Thanks,

Vidar
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 AKV 11 months ago in reply to Vidar Berg

Hi,

Thanks for the response.

We are developing a mesh network. Our requirement is the one device can connect to three other devices. Using one connection as a peripheral and up to two connections as a central for mesh connections.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Vidar Berg 9 months ago in reply to AKV

You should be able to use the WDT (Task Watchdog or WD driver) as a temporary workaround for the problem. At least one of the WD reload registers need to be reserved for the main loop to catch the scenario where bt_gatt_write_without_response_cb does not return.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 AKV 9 months ago in reply to Vidar Berg
Hi,

The bt_gatt_write_without_response_cb() function becomes blocking when we continue sending data after the stack buffer is full. To prevent this blocking behavior, we will stop sending data before the buffer reaches its limit.

The maximum supported number of connection is 3. and the CONFIG_BT_BUF_ACL_TX_COUNT is set to 30, allowing 10 buffers for each connection. I'm using a write complete callback to monitor buffer usage. If the buffer usage for any connection exceeds 10 due to the stack buffer full , I stop writing to that connection until it is cleared.

So our application will avoid entering a blocking state. You can review the logic that implements this in the broadcastMeshData() function.

If you'd like to test without blocking the bt_gatt_write_without_response_cb() function, you can modify the code as shown below. I've added comments indicating which lines need to be commented out and which ones should be uncommented.

if((!StackBufferManage.BufferFull[i]) && (ConnectionData.node_conn[i].HandshakeDone)){ // un comment this line //if((ConnectionData.node_conn[i].HandshakeDone)){ // comment this line #if(BLE_TXMTION_TYPE== WRITE_WITHOUT_RES) err = ble_gatt_write_txmtion(&ConnectionData.node_conn[i], data, len); #elif(BLE_TXMTION_TYPE== NOTIFICATION) if(bt_gatt_is_subscribed(ConnectionData.node_conn[i].conn, notify_attr, BT_GATT_CCC_NOTIFY)) err=bt_gatt_notify_cb(ConnectionData.node_conn[i].conn,&tx_notify_params); #endif if (!err) { StackBufferManage.time[i] = k_uptime_get_32(); StackBufferManage.InIndex[i]++; // printk("TX[%d] \n",i); }else{ printk("bt_gatt_write error : %d\n",err); } }

The disconnection event is missing, even though the bt_gatt_write_without_response_cb() function is non-blocking.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Vidar Berg 9 months ago in reply to AKV

Hi,

I understand, but bt_gatt_write_without_response_cb() should not remain blocking after the connection has been lost, so I suspect it may be a symptom of the same root cause that prevents the disconnect event from coming through. I'm currently working with R&D to find out if this may be caused by a bug in the BT stack.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 JeffW 9 months ago in reply to Vidar Berg

Hi, I'm also coming across this issue based on ncs v2.7.0
nRF53 is missing the disconnect event specifically when I seem to have a pending gatt write, and a device disconnects ungracefully.
Even the supervision timeout does not cause a disconnect event.

When the ATT timeout hits (30s later), it unsubscribes from the characteristic. If I call disconnect from this context and bt_conn_unref my indexes - internally the conn state stays state Disconnecting, and new scans show the device already having a conn, so a new connection cannot be made.

I am wondering if there is any progress here, or if I should work towards the newer bluetooth host controller which does not use the system workqueue. :)
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Vidar Berg 9 months ago in reply to JeffW
Hi, yes, we have found that sending ATT packets from the 'BT RX' thread context can result in a deadlock if the peer device stops responding, which contradicts the expected behavior described in the API documentation. The deadlock prevents the processing of the disconnect event and the freeing of stack buffers. Unfortunately, we were also able to reproduce this with SDK v2.7.0.

The expected behaviour according to the documentation:

I'm not sure how we plan to address this bug yet, but to eliminate the issue for now, I suggest removing any ATT requests being sent from the 'BT RX' thread (BT callbacks). Alternatively, ensure that ATT requests from this thread are not sent in parallel with other threads and that the number of requests is limited to mitigate the risk of running out of buffers. For instance, if you are performing service discovery, ensure the app waits for the discovery to fully complete before initiating any ATT requests from other threads such as 'main'. The same also applies to the MTU exchange.

To help locate potentially problematic API calls (i.e. gatt_writes, etc from BT callbacks), you may instrument the code by inserting the following code snippet before the bt_l2cap_create_pdu_timeout() call:

if (IS_ENABLED(CONFIG_THREAD_NAME)) { k_tid_t current_thread = k_current_get(); const char *threadname = k_thread_name_get(current_thread); if ((strcmp(threadname, "BT RX") == 0) && (timeout.ticks == -1)) { LOG_ERR("API violation - likely calling Bluetooth API from callback"); } }

And build the project with CONFIG_LOG=y and CONFIG_THREAD_NAME=y. You can then place a breakpoint at the LOG_ERR() line and use the call stack to determine where the call originated from.

AKV , I believe the deadlock in your project may caused by the meshDataReceivedHandler() callback. Please try offloading the data sending from this callback to another thread or the workqueue.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 Vidar Berg 9 months ago in reply to JeffW
Hi, yes, we have found that sending ATT packets from the 'BT RX' thread context can result in a deadlock if the peer device stops responding, which contradicts the expected behavior described in the API documentation. The deadlock prevents the processing of the disconnect event and the freeing of stack buffers. Unfortunately, we were also able to reproduce this with SDK v2.7.0.

The expected behaviour according to the documentation:

I'm not sure how we plan to address this bug yet, but to eliminate the issue for now, I suggest removing any ATT requests being sent from the 'BT RX' thread (BT callbacks). Alternatively, ensure that ATT requests from this thread are not sent in parallel with other threads and that the number of requests is limited to mitigate the risk of running out of buffers. For instance, if you are performing service discovery, ensure the app waits for the discovery to fully complete before initiating any ATT requests from other threads such as 'main'. The same also applies to the MTU exchange.

To help locate potentially problematic API calls (i.e. gatt_writes, etc from BT callbacks), you may instrument the code by inserting the following code snippet before the bt_l2cap_create_pdu_timeout() call:

if (IS_ENABLED(CONFIG_THREAD_NAME)) { k_tid_t current_thread = k_current_get(); const char *threadname = k_thread_name_get(current_thread); if ((strcmp(threadname, "BT RX") == 0) && (timeout.ticks == -1)) { LOG_ERR("API violation - likely calling Bluetooth API from callback"); } }

And build the project with CONFIG_LOG=y and CONFIG_THREAD_NAME=y. You can then place a breakpoint at the LOG_ERR() line and use the call stack to determine where the call originated from.

AKV , I believe the deadlock in your project may caused by the meshDataReceivedHandler() callback. Please try offloading the data sending from this callback to another thread or the workqueue.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

0 AKV 9 months ago in reply to Vidar Berg

Hi,

Vidar Berg said:
Please try offloading the data sending from this callback to another thread or the workqueue.

I removed all the data-sending functions from the RX call back function meshDataReceivedHandler() and tested it, but the issue persists.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Vidar Berg 9 months ago in reply to AKV

Hi,

Yes, but what about the other things I mentioned? MTU exchange, etc.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 AKV 9 months ago in reply to Vidar Berg

Hi,

I am not using the service discovery. And also am not sending any packets before completing the MTU exchange.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Vidar Berg 9 months ago in reply to AKV

Hi,

Thanks for confirming. Could you please upload your revised code in the private ticket?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 JeffW 9 months ago in reply to Vidar Berg

Hi Vidar,

Appreciate you looking into this for me aswell.

I also removed all gatt sends from the BT RX WQ thread, and am still missing the disconnect event.

If there are any additional suggestions from the R&D team, please pass them along. :)
It looks like upstream Zephyr has a total ble host rewrite -- so that might be in the cards.

--Jeff
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel