[SDC/Zephyr] GATT notify TX path wedges (no HCI 0x13 credits) → bt_l2cap_create_pdu_timeout() blocks. Any workaround to recover from deadlock without disconnect? (NCS v3.0.2)

On a BLE Peripheral NRF52840 built with nRF Connect SDK v3.0.2, high-rate GATT notifications sometimes push the stack into a state where no (HCI 0x13) events arrive for the active connection. After that, bt_gatt_notify_cb() callbacks stop, the app’s in-flight window never drains, the queue backs up, and eventually bt_l2cap_create_pdu_timeout() blocks in net_buf_alloc(). Stopping scanning, disabling vendor events, or waiting does not recover it. Only a reset (or disconnect) clears it.

I am aware of the known issues listed here https://docs.nordicsemi.com/bundle/ncs-latest/page/nrf/releases_and_maturity/known_issues.html. So what I am looking for is a safe workaround to reset the BLE stack without tearing down the user connection, ideally something host-side (e.g., controller/host re-init) that doesn’t require a full MCU reset. I am getting ready to release into my production, but I cannot get around this deadlock. And the most annoying thing is it can take several hours to reproduce. So far the only way I have to recover is to either let the WDT reset or I force a reset in software.

Symptom summary:

Kernel is blocked inside bt_l2cap_create_pdu_timeout() → net_buf_alloc() waiting on a TX buffer. The controller still sends other HCI events so I know the SDC is still alive.

Top Replies

Einar Thorsrud 8 months ago in reply to dcooperch +1 verified

Hi, Solving these issues have priority and is currently being worked on. It could be that the main problem in this case is solved by this fix , which is part of SDK 3.1.0 or higher. These issues in…

Parents

0 Einar Thorsrud 9 months ago

Hi,

Can you try to set a CONFIG_BT_BUF_CMD_TX_COUNT with a higher value in your prj.conf and see if that is a workaround?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 dcooperch 9 months ago in reply to Einar Thorsrud
I have increased CONFIG_RT_BUF_CMD_TX_COUNT.

Here is a snippet from my conf:

# v0.0.10: Trying to fix the deadlocking issue with tx_notify # for tx_notify deadlock workaround, https://docs.nordicsemi.com/bundle/ncs-latest/page/nrf/releases_and_maturity/known_issues.html CONFIG_BT_HCI_ACL_FLOW_CONTROL=y CONFIG_BT_BUF_ACL_TX_COUNT=24 CONFIG_BT_BUF_EVT_RX_COUNT=25 CONFIG_BT_BUF_CMD_TX_COUNT=20 CONFIG_BT_ATT_TX_COUNT=10 CONFIG_BT_CONN_TX_MAX=24 CONFIG_BT_CONN_TX_NOTIFY_WQ=y

I am currently in the process of checking for any potential reference leaks where net_buf opjects may not be getting freed with net_buf_unref(). I have run into an issue where when the att_pool still had one more net_buf allocated, it would stay in that state for sever hundred seconds. So I am running an experiential hack to catch orphaned net_buf objects in the att_pool and seeing if unreferencing the leftovers in a brute force way allows `gatt_notify_cb()` to proceed. At the very least this would help determine if the problem is with any leaks or the consumer itself has stopped consuming. When I dive deeper into the code, it seems that I cannot dig deeper into `bt_l2cap_chan_send` because I don't have the source code for that unless I enable CONFIG_BT_L2CAP_DYNAMIC_CHANNEL, so the code path is a little sketchy past that. But I'm right now trying to reproduce the issue again so I can see if my experiment works, but the failure time can go anywhere from a few minutes to several hours until this issue pops up.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

0 dcooperch 9 months ago in reply to dcooperch

I have run this abomination of code here to confirm my suspicions: If I cleanup references after a few seconds of no change to the pool size, then the application lives longer and proceeds to function. But this junking code is dangerous and causes other issues. What I can tell you is the `gatt_notify_cb` will continue to run without blocking.

/* ⚠️ Danger: forcibly drop refs on all live bufs in the pool. */
static void force_free_pool(struct net_buf_pool *pool)
{
    unsigned int key = irq_lock();   /* reduce races for this experiment */
    struct net_buf *b = pool->__bufs;

    for (int i = 0; i < pool->buf_count; i++, b++)
    {
        if (b->len)
        {
            while (b->ref > 0) {
                LOG_WRN("FORCE unref buf=%p ref=%d len=%u", b, b->ref, b->len);
                net_buf_unref(b);
            }
        }
    }
    irq_unlock(key);
}

Reply

0 dcooperch 9 months ago in reply to dcooperch

/* ⚠️ Danger: forcibly drop refs on all live bufs in the pool. */
static void force_free_pool(struct net_buf_pool *pool)
{
    unsigned int key = irq_lock();   /* reduce races for this experiment */
    struct net_buf *b = pool->__bufs;

    for (int i = 0; i < pool->buf_count; i++, b++)
    {
        if (b->len)
        {
            while (b->ref > 0) {
                LOG_WRN("FORCE unref buf=%p ref=%d len=%u", b, b->ref, b->len);
                net_buf_unref(b);
            }
        }
    }
    irq_unlock(key);
}

Children

0 dcooperch 9 months ago in reply to dcooperch

I might also concede that this is a balancing game between CONFIG_BT_ATT_TX_COUNT and my target performance but getting stuck for hundreds of seconds at a time is not ideal.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Einar Thorsrud 9 months ago in reply to dcooperch

Hi,

I understand the problem. We are looking into fixing the known issues in the Bluetooth stack, but as you see from the known issues description of the relevant issues, the workaround is typically to increase relevant buffer sizes, but this is not guaraneteed to resolve all issues in all cases as you have seen.

There is also no proper way to clear the buffers or reset parts of the stack wile maintaining a connection, unfortunately.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 dcooperch 9 months ago in reply to Einar Thorsrud

Thank you for your reply. I'm not convinced that the issue is directly tied to buffer exhaustion. I have added throttling in my application to monitor the buffers and wait for the consumer to catch up. With CONFIG_NET_BUF_POOL_USAGE, enabled I throttle based on the remaining avail_count value, before the att pool approaches 0, the application stops and waits. The problem that I am seeing is the consumer just stops for very long periods of time, either a deadlock or several hundred seconds.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel