[SDC/Zephyr] GATT notify TX path wedges (no HCI 0x13 credits) → bt_l2cap_create_pdu_timeout() blocks. Any workaround to recover from deadlock without disconnect? (NCS v3.0.2)

On a BLE Peripheral NRF52840 built with nRF Connect SDK v3.0.2, high-rate GATT notifications sometimes push the stack into a state where no (HCI 0x13) events arrive for the active connection. After that, bt_gatt_notify_cb() callbacks stop, the app’s in-flight window never drains, the queue backs up, and eventually bt_l2cap_create_pdu_timeout() blocks in net_buf_alloc(). Stopping scanning, disabling vendor events, or waiting does not recover it. Only a reset (or disconnect) clears it.

I am aware of the known issues listed here https://docs.nordicsemi.com/bundle/ncs-latest/page/nrf/releases_and_maturity/known_issues.html.  So what I am looking for is a safe workaround to reset the BLE stack without tearing down the user connection, ideally something host-side (e.g., controller/host re-init) that doesn’t require a full MCU reset.  I am getting ready to release into my production, but I cannot get around this deadlock.  And the most annoying thing is it can take several hours to reproduce.  So far the only way I have to recover is to either let the WDT reset or I force a reset in software.  

Symptom summary:

Kernel is blocked inside bt_l2cap_create_pdu_timeout()net_buf_alloc() waiting on a TX buffer. The controller still sends other HCI events so I know the SDC is still alive.

  • Hi,

    Can you try to set a CONFIG_BT_BUF_CMD_TX_COUNT with a higher value in your prj.conf and see if that is a workaround?

  • I have increased CONFIG_RT_BUF_CMD_TX_COUNT.  

    Here is a snippet from my conf:

    # v0.0.10: Trying to fix the deadlocking issue with tx_notify
    # for tx_notify deadlock workaround, https://docs.nordicsemi.com/bundle/ncs-latest/page/nrf/releases_and_maturity/known_issues.html
    CONFIG_BT_HCI_ACL_FLOW_CONTROL=y
    CONFIG_BT_BUF_ACL_TX_COUNT=24
    CONFIG_BT_BUF_EVT_RX_COUNT=25
    CONFIG_BT_BUF_CMD_TX_COUNT=20
    CONFIG_BT_ATT_TX_COUNT=10
    CONFIG_BT_CONN_TX_MAX=24
    CONFIG_BT_CONN_TX_NOTIFY_WQ=y

    I am currently in the process of checking for any potential reference leaks where net_buf opjects may not be getting freed with net_buf_unref().  I have run into an issue where when the att_pool still had one more net_buf allocated, it would stay in that state for sever hundred seconds.  So I am running an experiential hack to catch orphaned net_buf objects in the att_pool and seeing if unreferencing the leftovers in a brute force way allows `gatt_notify_cb()` to proceed.  At the very least this would help determine if the problem is with any leaks or the consumer itself has stopped consuming.  When I dive deeper into the code, it seems that I cannot dig deeper into `bt_l2cap_chan_send` because I don't have the source code for that unless I enable CONFIG_BT_L2CAP_DYNAMIC_CHANNEL, so the code path is a little sketchy past that.  But I'm right now trying to reproduce the issue again so I can see if my experiment works, but the failure time can go anywhere from a few minutes to several hours until this issue pops up.

  • I have run this abomination of code here to confirm my suspicions:  If I cleanup references after a few seconds of no change to the pool size, then the application lives longer and proceeds to function.  But this junking code is dangerous and causes other issues.  What I can tell you is the `gatt_notify_cb` will continue to run without blocking.

    /* ⚠️ Danger: forcibly drop refs on all live bufs in the pool. */
    static void force_free_pool(struct net_buf_pool *pool)
    {
        unsigned int key = irq_lock();   /* reduce races for this experiment */
        struct net_buf *b = pool->__bufs;
    
        for (int i = 0; i < pool->buf_count; i++, b++)
        {
            if (b->len)
            {
                while (b->ref > 0) {
                    LOG_WRN("FORCE unref buf=%p ref=%d len=%u", b, b->ref, b->len);
                    net_buf_unref(b);
                }
            }
        }
        irq_unlock(key);
    }

  • I might also concede that this is a balancing game between CONFIG_BT_ATT_TX_COUNT and my target performance but getting stuck for hundreds of seconds at a time is not ideal.

  • Hi,

    I understand the problem. We are looking into fixing the known issues in the Bluetooth stack, but as you see from the known issues description of the relevant issues, the workaround is typically to increase relevant buffer sizes, but this is not guaraneteed to resolve all issues in all cases as you have seen.

    There is also no proper way to clear the buffers or reset parts of the stack wile maintaining a connection, unfortunately.

Related