Central stalling with two peripherals when one is disconnecting

Question

Hi everyone, I have a central firmware application connected to 2 peripheral devices: A and B. 
 I use the method bt_gatt_write_without_response_cb (from zephyr>subsys>bluetooth>host>gatt.c ) to send my data to the different peripherals. In order not to overload any internal stack buffers, I use flow control: for each peripheral an individual semaphore is taken when sending and given in the supplied callback. This is inspired by https://devzone.nordicsemi.com/f/nordic-q-a/110250/multi-nus-sending-several-messages-in-a-row . As a consequence, I never see the error -ENOMEM as return value from bt_gatt_write_without_response_cb . The problem: if one of the peripherals, say A, looses its connection, messages to peripheral B also get stuck. The observable reason being that the callback submitted to bt_gat_write_without_response_cb is not executed. Messages to B will resume as soon as we get a disconnection event and subsequently the expected callback for B. Typically, the disconnection event will be triggered after the supervision timeout of the BLE connection - which is too long for peripheral B. In some cases it is directly triggered upon loosing the connection (i.e. long before the supervision timeout). These cases are not problematic even though I do not understand on which level the disconnect event is triggered. Is this expected behavior due to some limitation in the stack? That would really surprise me because I do not see why other devices could not go on with communication if a specific device is stalled/disconnected. In any case, I would greatly appreciate if somebody could explain me what is happening, or even offer a solution. Some info concerning my setup: 
 
 I work with the nRF SDK v3.1.0 
 Settings of relevant Kconfigs: 
 
 """ CONFIG_BT=y CONFIG_BT_CENTRAL=y CONFIG_BT_HCI=y CONFIG_BT_HCI_ACL_FLOW_CONTROL=y CONFIG_BT_CONN_TX_NOTIFY_WQ=y CONFIG_BT_GATT_CLIENT=y CONFIG_BT_GATT_DM=y CONFIG_BT_MAX_CONN=3 CONFIG_BT_BUF_ACL_TX_COUNT=6 CONFIG_BT_BUF_CMD_TX_COUNT=6 """ Kind regards, Alberto

Alberto Calatroni · Answer

Hi again Edvin and Nordic team, We have some updates and we think we now understand what is happening. Our hypothesis is that the "write without response" triggers the callback when the message could be successfully passed to the host , which does not mean that the host / controller were yet able to send this out. The consequence was that the internal buffers would get filled up if one of the connections was bad. What we did is now to revert to the API which triggers a "write with response", and wait for the corresponding callback before unlocking the semaphore again. Like this, we only proceed when we are sure that the message was effectively sent (and acknowledged). This reduces somewhat the throughput, since we need to skip at least one connection interval, but is acceptable for our application. 
 Could you maybe confirm that our understanding is correct? Maybe it would be helpful to improve the include files / documentation regarding these behaviors. 
 Kind regards, Alberto