[SDC/Zephyr] GATT notify TX path wedges (no HCI 0x13 credits) → bt_l2cap_create_pdu_timeout() blocks. Any workaround to recover from deadlock without disconnect? (NCS v3.0.2)

On a BLE Peripheral NRF52840 built with nRF Connect SDK v3.0.2, high-rate GATT notifications sometimes push the stack into a state where no (HCI 0x13) events arrive for the active connection. After that, bt_gatt_notify_cb() callbacks stop, the app’s in-flight window never drains, the queue backs up, and eventually bt_l2cap_create_pdu_timeout() blocks in net_buf_alloc(). Stopping scanning, disabling vendor events, or waiting does not recover it. Only a reset (or disconnect) clears it.

I am aware of the known issues listed here https://docs.nordicsemi.com/bundle/ncs-latest/page/nrf/releases_and_maturity/known_issues.html.  So what I am looking for is a safe workaround to reset the BLE stack without tearing down the user connection, ideally something host-side (e.g., controller/host re-init) that doesn’t require a full MCU reset.  I am getting ready to release into my production, but I cannot get around this deadlock.  And the most annoying thing is it can take several hours to reproduce.  So far the only way I have to recover is to either let the WDT reset or I force a reset in software.  

Symptom summary:

Kernel is blocked inside bt_l2cap_create_pdu_timeout()net_buf_alloc() waiting on a TX buffer. The controller still sends other HCI events so I know the SDC is still alive.

Parents
  • Hi,

    Can you try to set a CONFIG_BT_BUF_CMD_TX_COUNT with a higher value in your prj.conf and see if that is a workaround?

  • I have increased CONFIG_RT_BUF_CMD_TX_COUNT.  

    Here is a snippet from my conf:

    # v0.0.10: Trying to fix the deadlocking issue with tx_notify
    # for tx_notify deadlock workaround, https://docs.nordicsemi.com/bundle/ncs-latest/page/nrf/releases_and_maturity/known_issues.html
    CONFIG_BT_HCI_ACL_FLOW_CONTROL=y
    CONFIG_BT_BUF_ACL_TX_COUNT=24
    CONFIG_BT_BUF_EVT_RX_COUNT=25
    CONFIG_BT_BUF_CMD_TX_COUNT=20
    CONFIG_BT_ATT_TX_COUNT=10
    CONFIG_BT_CONN_TX_MAX=24
    CONFIG_BT_CONN_TX_NOTIFY_WQ=y

    I am currently in the process of checking for any potential reference leaks where net_buf opjects may not be getting freed with net_buf_unref().  I have run into an issue where when the att_pool still had one more net_buf allocated, it would stay in that state for sever hundred seconds.  So I am running an experiential hack to catch orphaned net_buf objects in the att_pool and seeing if unreferencing the leftovers in a brute force way allows `gatt_notify_cb()` to proceed.  At the very least this would help determine if the problem is with any leaks or the consumer itself has stopped consuming.  When I dive deeper into the code, it seems that I cannot dig deeper into `bt_l2cap_chan_send` because I don't have the source code for that unless I enable CONFIG_BT_L2CAP_DYNAMIC_CHANNEL, so the code path is a little sketchy past that.  But I'm right now trying to reproduce the issue again so I can see if my experiment works, but the failure time can go anywhere from a few minutes to several hours until this issue pops up.

  • I have run this abomination of code here to confirm my suspicions:  If I cleanup references after a few seconds of no change to the pool size, then the application lives longer and proceeds to function.  But this junking code is dangerous and causes other issues.  What I can tell you is the `gatt_notify_cb` will continue to run without blocking.

    /* ⚠️ Danger: forcibly drop refs on all live bufs in the pool. */
    static void force_free_pool(struct net_buf_pool *pool)
    {
        unsigned int key = irq_lock();   /* reduce races for this experiment */
        struct net_buf *b = pool->__bufs;
    
        for (int i = 0; i < pool->buf_count; i++, b++)
        {
            if (b->len)
            {
                while (b->ref > 0) {
                    LOG_WRN("FORCE unref buf=%p ref=%d len=%u", b, b->ref, b->len);
                    net_buf_unref(b);
                }
            }
        }
        irq_unlock(key);
    }

Reply
  • I have run this abomination of code here to confirm my suspicions:  If I cleanup references after a few seconds of no change to the pool size, then the application lives longer and proceeds to function.  But this junking code is dangerous and causes other issues.  What I can tell you is the `gatt_notify_cb` will continue to run without blocking.

    /* ⚠️ Danger: forcibly drop refs on all live bufs in the pool. */
    static void force_free_pool(struct net_buf_pool *pool)
    {
        unsigned int key = irq_lock();   /* reduce races for this experiment */
        struct net_buf *b = pool->__bufs;
    
        for (int i = 0; i < pool->buf_count; i++, b++)
        {
            if (b->len)
            {
                while (b->ref > 0) {
                    LOG_WRN("FORCE unref buf=%p ref=%d len=%u", b, b->ref, b->len);
                    net_buf_unref(b);
                }
            }
        }
        irq_unlock(key);
    }

Children
Related