System workqueue lock-up when waiting for bt_gatt_read callback

Hi,

I'm working on an application where I want to read a couple of characteristics to get information about the connected device before continuing doing useful stuff. I had a working implementation with ncs v2.5.0, which looked something like this (pss is our custom service, based on implementations found in ncs):

static void discovery_complete(struct bt_gatt_dm *dm, void *context)
{
    LOG_INF("Discovery completed");
    pss_client_handles_assign(dm, &sensor->pss_client);
    bt_gatt_dm_data_release(dm);
    
    k_work_submit(&sensor->deferred_work);
}

static void deferred_work_handler(struct k_work *work)
{

    struct pss_client *pss_client = &sensor->pss_client;

    // Read the UUID of the sensor
    struct bt_gatt_read_params *params = &pss_client->uuid_params;
    params->func = on_uuid_received;
    params->handle_count = 1;
    params->single.offset = 0;
    params->single.handle = pss_client->handles.uuid_handle;

    int err = bt_gatt_read(pss_client->conn, params);
    if (err)
    {
        LOG_ERR("Failed to read UUID (err %d)", err);
    }

    err = k_sem_take(&sem_rsp, K_MSEC(3000));
    if (err == -EAGAIN)
    {
        LOG_ERR("Failed to read UUID of the sensor (err %d)", err);
        return;
    }

     
    // Start doing something useful
}

static void on_uuid_received(struct pss_client *pss_client, uint64_t uuid)
{
    // Simplified
    LOG_INF("Sensor UUID: %016llx", uuid)
    k_sem_give(&sem_rsp);
}

However, after updating our project to ncs v3.2.2, this seems to break. I still get the "Discovery completed" log, but then I see error messages indicating that the uuid read failed due to the semaphore timing out. Interestingly, I do still see the UUID being printed immediately after that. So it seems that there is some kind of lock-up happening due to my waiting on the semaphore in the workqueue. I was under the impression that the bt_gatt_read_func_t callback function would not be called in the context of the systemworkq, and would therefore not be blocked by waiting in the workq, but it seems that that is the case now as I only get the uuid printed after the semaphore expired. I noticed in my .config file that CONFIG_SYSTEM_WORKQUEUE_PRIORITY is now part of the "Bluetooth Services", whereas for my build with ncs v2.5.0 this was part of the "System Work Queue Options". So it seems something changed to the way how the bluetooth services use the system workq.

I did some further digging, and I found I get things to work again by setting up a dedicated workq with a priority below that of the systemworkq (I took a priority of 3 for testing), which seems to reinforce that the issue is due to a lock-up in the system workq. Weirdly, I also tried doing the deferred_work directly in the discovery_complete function (instead of submitting work) after I learned that the BT_GATT_DM has its own workq since ncs v2.8.0, but that didn't work.

While I did find a workaround, I'm also keen on understanding what changed exactly in the way characteristics are read, to break my original implementation. That way, I can anticipate if I need to change other parts of my code as well.

Best,

Wout

Parents
  • Hi,

    I have not (at least yet) found a change which could fully explain what you see, related to the way GATT reads are done, but it might boil down to priorities or change of workqueue used (by the stack) or the priorities of those workqueues. In particular since you use bt_gatt_read(), waiting for its callback which gets called in BT_RECV_CONTEXT context. From what I understand, what you see is basically that this callback gets blocked?

    In order to narrow it down, it would be interesting to know what SoC you use, as well as any kconfigs related to workqueues and priorities (such as e.g. CONFIG_BT_GATT_DM_WORKQ_SYS v.s. CONFIG_BT_GATT_DM_WORKQ_OWN, as well as CONFIG_BT_GATT_DM_WORKQ_PRIO, and any other *WORKQ* configs that might touch on the Bluetooth stack or the workqueues that you use from your application.)

    Regards,
    Terje

  • Hi tesc,

    Sorry for the late reply, priorities shifted to some other project for a while. But I just got thinking about this issue while browsing the known issues, and I wonder if this is related to NCSDK-31528 as it seems to match with my description I don't have CONFIG_BT_HCI_ACL_FLOW_CONTROL enabled. I'll try enabling this when I have some time to confirm if this was the case.

Reply
  • Hi tesc,

    Sorry for the late reply, priorities shifted to some other project for a while. But I just got thinking about this issue while browsing the known issues, and I wonder if this is related to NCSDK-31528 as it seems to match with my description I don't have CONFIG_BT_HCI_ACL_FLOW_CONTROL enabled. I'll try enabling this when I have some time to confirm if this was the case.

Children
  • Seems like this wasn't the issue, enabling CONFIG_BT_HCI_ACL_FLOW_CONTROL didn't fix it.

    This is on the nRF54L15. I don't have any WORKQ configs in my prj.conf, looking at the ouput .config, these are the relevant WORKQ settings:

    CONFIG_BT_GATT_DM_WORKQ_OWN=y
    # CONFIG_BT_GATT_DM_WORKQ_SYS is not set
    CONFIG_BT_GATT_DM_WORKQ_STACK_SIZE=1300
    CONFIG_BT_GATT_DM_WORKQ_PRIO=10
    CONFIG_BT_GATT_DM_WORKQ_INIT_PRIO=50
    CONFIG_SYSTEM_WORKQUEUE_PRIORITY=-1 (part of Bluetooth Services, as I mentioned before)
    # CONFIG_BT_RECV_WORKQ_SYS is not set
    CONFIG_BT_RECV_WORKQ_BT=y
    Do you see anything here that doesn't make sense?
Related