Central stalling with two peripherals when one is disconnecting

Hi everyone,

I have a central firmware application connected to 2 peripheral devices: A and B.

I use the method bt_gatt_write_without_response_cb (from zephyr>subsys>bluetooth>host>gatt.c) to send my data to the different peripherals. In order not to overload any internal stack buffers, I use flow control: for each peripheral an individual semaphore is taken when sending and given in the supplied callback. This is inspired by https://devzone.nordicsemi.com/f/nordic-q-a/110250/multi-nus-sending-several-messages-in-a-row.

As a consequence, I never see the error -ENOMEM as return value from bt_gatt_write_without_response_cb.

The problem: if one of the peripherals, say A, looses its connection, messages to peripheral B also get stuck. The observable reason being that the callback submitted to bt_gat_write_without_response_cb is not executed. Messages to B will resume as soon as we get a disconnection event and subsequently the expected callback for B.  Typically, the disconnection event will be triggered after the supervision timeout of the BLE connection - which is too long for peripheral B. In some cases it is directly triggered upon loosing the connection (i.e. long before the supervision timeout). These cases are not problematic even though I do not understand on which level the disconnect event is triggered.

Is this expected behavior due to some limitation in the stack? That would really surprise me because I do not see why other devices could not go on with communication if a specific device is stalled/disconnected. In any case, I would greatly appreciate if somebody could explain me what is happening, or even offer a solution.

Some info concerning my setup:

  • I work with the nRF SDK v3.1.0
  • Settings of relevant Kconfigs:

"""
CONFIG_BT=y
CONFIG_BT_CENTRAL=y
CONFIG_BT_HCI=y
CONFIG_BT_HCI_ACL_FLOW_CONTROL=y
CONFIG_BT_CONN_TX_NOTIFY_WQ=y
CONFIG_BT_GATT_CLIENT=y
CONFIG_BT_GATT_DM=y
CONFIG_BT_MAX_CONN=3
CONFIG_BT_BUF_ACL_TX_COUNT=6
CONFIG_BT_BUF_CMD_TX_COUNT=6
"""

Kind regards,
Alberto

Parents
  • Hello,

    I am not sure how fast you are trying to push these messages. Could it be that you have filled the CONFIG_BT_ATT_TX_COUNT? The bt_gatt_write_without_response_cb() is a blocking call, meaning if it can't place the message in the buffer, it will block until it can fit the message. 

    Try increasing the CONFIG_BT_ATT_TX_COUNT and see if that helps. However, if you fill it up, it will still be blocking.

    You could try to keep the transmission of the two different peripherals to different threads, and monitor how many messages are in each queue (to make sure you don't fill up the CONFIG_BT_ATT_TX_COUNT). 

    Best regards,

    Edvin

  • Hi Edvin,

    Thanks for your quick answer. From the documentation I understand that the function is blocking if called from another thread, but from the same thread it should just return -ENOMEM. This is the assumption under which we based our architecture. Do we misunderstand something?

    Best regards,
    Alberto

Reply Children
  • Hi again Edvin and Nordic team,

    We have some updates and we think we now understand what is happening.

    Our hypothesis is that the "write without response" triggers the callback when the message could be successfully passed to the host, which does not mean that the host / controller were yet able to send this out. The consequence was that the internal buffers would get filled up if one of the connections was bad.

    What we did is now to revert to the API which triggers a "write with response", and wait for the corresponding callback before unlocking the semaphore again. Like this, we only proceed when we are sure that the message was effectively sent (and acknowledged). This reduces somewhat the throughput, since we need to skip at least one connection interval, but is acceptable for our application.

    Could you maybe confirm that our understanding is correct? Maybe it would be helpful to improve the include files / documentation regarding these behaviors.

    Kind regards,
    Alberto

  • Hello Alberto,

    Yes. That sounds about right!

    In have not seen your application, but you can do a check every time you try to send a message, to see if you are still connected (set some flag in the disconnected event). 

    As you say, waiting for the TX complete callback will cause a drop in the throughput. One option is to keep some internal counter on how many packets you have queued and how many that have been acked, making sure you only have N queued messages at the time. Even better, if you keep track of what device each of these messages are meant for, you can clear that counter when one device disconnects, as these packets are discarded. This way, you can always have enough messages in the queue to maximize your throughput, without overloading the Bluetooth stack when you receive a disconnect.

    Best regards,

    Edvin

Related