l2cap_data_pull hard fault on disconnect with queued tx data

denis 5 months ago

When using l2cap to stream data at a high rate from device to an app we are seeing a hard fault in l2cap_data_pull when there is queued l2cap send data.

When the fault occurs l2cap_data_pull, conn is in the disconnected state. It appears that the pdu->data pointer is 0. Then net_buf_push tries to adjust pdu->data which results in 0xfffffffc. Dereferencing that causes the hard fault:

hdr = net_buf_push(pdu, sizeof(*hdr));

hdr->len = sys_cpu_to_le16(pdu_len);

Using nRF Connect SDK 2.9.0 and nRF5340.

Should there be a check for conn status disconnected? Or for pdu->data == NULL? Or is this a race condition?

Parents

0 Amanda Hsieh 5 months ago

Hi,

The race condition could be in the your applications, please check if you already do something like in the linked code snippet to avoid calling APIs with the conn pointer while you unref the conn pointer on disconnected:

https://github.com/nrfconnect/sdk-zephyr/blob/v3.7.99-ncs1/samples/bluetooth/central_gatt_write/src/central_gatt_write.c#L104-L116

Regards,
Amanda H.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 denis 5 months ago in reply to Amanda Hsieh

The bt_l2cap_chan_send function that we use checks the channel connection state. Is that not adequate?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Amanda Hsieh 5 months ago in reply to denis

If your application is calling bt_conn_unref() from multiple threads/callbacks on a global variable conn pointer that the application has declared, then every threads/callbacks that pass this global variable conn pointer will need to take a reference count (call bt_conn_ref() to ensure the other thread does not unreference this global variable conn pointer.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 denis 5 months ago in reply to Amanda Hsieh

We only call bt_conn_ref and hold the conn in one (volatile) global when the connected callback of struct bt_conn_cb is called. That global is zeroed and then bt_conn_unref is called from struct bt_conn_cb disconnected callback. The conn is not referenced when sending. The struct bt_l2cap_le_chan is used when sending. There is only one place that calls bt_l2cap_chan_send. I added a check there that the conn is not zero and it is never zero when bt_l2cap_chan_send is called.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 denis 5 months ago in reply to denis

I also added a check that struct bt_l2cap_le_chan chan_ops disconnected has not been called before we call bt_l2cap_chan_send.

So I think the send data is queued up the ble stack before the disconnect happens.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 denis 5 months ago in reply to denis

Just to check to see if bt_conn_unref was the issue, I commented that out to test. The same hard fault in l2cap_data_pull occurs. So doesn't seem to be an issue with an early unref.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 denis 5 months ago in reply to denis

Just to check to see if bt_conn_unref was the issue, I commented that out to test. The same hard fault in l2cap_data_pull occurs. So doesn't seem to be an issue with an early unref.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

0 Amanda Hsieh 5 months ago in reply to denis

Could you try the suggested code snippet?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 denis 5 months ago in reply to Amanda Hsieh

I can't do that exactly because I'm using the l2cap calls, but I did test adding a ref/unref around the send and the same hard fault still happens.

struct bt_conn *conn = bt_conn_ref(conn_connected);

int result = bt_l2cap_chan_send(&pt_ble.l2cap.le_chan.chan, buf);

bt_conn_unref(conn);
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Amanda Hsieh 4 months ago in reply to denis

Hi,

It might be a race condition between l2cap_data_pull and bt_l2cap_chan_del or l2cap_chan_shutdown. It should not happen since l2cap_data_pull runs on the system work queue, which is forced to be non-preemptible by the Bluetooth subsystem Kconfig. Have you overridden this restriction and made the system work queue preemtible?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 denis 4 months ago in reply to Amanda Hsieh

We have not modified the SYSTEM_WORKQUEUE_PRIORITY configuration.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Amanda Hsieh 4 months ago in reply to denis
Could you try to surround bt_l2cap_dyn_chan_send in a k_sched_lock as a step toward debugging? This makes the thread temporarily cooperative. It can be done by applying the attached diff to NCS v2.9.0.

Fullscreen bt_l2cap_chan_send-k_sched_lock.patch.txt Download

#!/usr/bin/env -S git apply # Attachment to Nordic Devzone https://devzone.nordicsemi.com/f/nordic-q-a/118471/l2cap_data_pull-hard-fault-on-disconnect-with-queued-tx-data diff --git a/subsys/bluetooth/host/l2cap.c b/subsys/bluetooth/host/l2cap.c index ed185d12fd7..58bb4cf13aa 100644 --- a/subsys/bluetooth/host/l2cap.c +++ b/subsys/bluetooth/host/l2cap.c @@ -3283,7 +3283,18 @@ static int bt_l2cap_dyn_chan_send(struct bt_l2cap_le_chan *le_chan, struct net_b return 0; } -int bt_l2cap_chan_send(struct bt_l2cap_chan *chan, struct net_buf *buf) +static int bt_l2cap_chan_send_(struct bt_l2cap_chan *chan, struct net_buf *buf); +int bt_l2cap_chan_send(struct bt_l2cap_chan *chan, struct net_buf *buf) +{ + int err; + + k_sched_lock(); + err = bt_l2cap_chan_send_(chan, buf); + k_sched_unlock(); + + return err; +} +static int bt_l2cap_chan_send_(struct bt_l2cap_chan *chan, struct net_buf *buf) { if (!buf || !chan) { return -EINVAL;

Please let me know if it can work or not.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel