Assert fires when disconnecting while data is transmitted via L2CAP

I have a nRF5340 using nrf connect v3.7.99-nc1 that is in a peripheral role. I establish a l2cap connection to the peripheral, to request a larger amount of data. That works quite ok until I close that connection from the receiving side while still some l2cap SDUs are on the flight. An asserts fires from within a thread (id 12) that I did not spawned:

(gdb) info thread
  Id   Target Id                                      Frame 
  2    Thread 536977456 (main PENDING PRIO 0)         __set_BASEPRI (basePri=0)
    at /Users/todi/bloomlife/zephyr-workspace/modules/hal/cmsis/CMSIS/Core/Include/cmsis_gcc.h:1316
  3    Thread 536977640 (idle UNKNOWN PRIO 15)        __ISB () at /Users/todi/bloomlife/zephyr-workspace/modules/hal/cmsis/CMSIS/Core/Include/cmsis_gcc.h:261
  4    Thread 536974176 (logging PENDING PRIO 14)     __set_BASEPRI (basePri=0)
    at /Users/todi/bloomlife/zephyr-workspace/modules/hal/cmsis/CMSIS/Core/Include/cmsis_gcc.h:1316
  5    Thread 536974360 (usbworkq PENDING PRIO 255)   __ISB () at /Users/todi/bloomlife/zephyr-workspace/modules/hal/cmsis/CMSIS/Core/Include/cmsis_gcc.h:260
  6    Thread 536977824 (sysworkq QUEUED PRIO 255)    __ISB () at /Users/todi/bloomlife/zephyr-workspace/modules/hal/cmsis/CMSIS/Core/Include/cmsis_gcc.h:260
  7    Thread 536974576 (BT LW WQ PENDING PRIO 10)    __ISB () at /Users/todi/bloomlife/zephyr-workspace/modules/hal/cmsis/CMSIS/Core/Include/cmsis_gcc.h:260
  8    Thread 536977192 (usbd_workq PENDING PRIO 255) __ISB () at /Users/todi/bloomlife/zephyr-workspace/modules/hal/cmsis/CMSIS/Core/Include/cmsis_gcc.h:260
  9    Thread 536976856 (rx_q[0] PENDING PRIO 255)    __ISB () at /Users/todi/bloomlife/zephyr-workspace/modules/hal/cmsis/CMSIS/Core/Include/cmsis_gcc.h:260
  10   Thread 536974792 (BT RX WQ PENDING PRIO 248)   __set_BASEPRI (basePri=0)
    at /Users/todi/bloomlife/zephyr-workspace/modules/hal/cmsis/CMSIS/Core/Include/cmsis_gcc.h:1316
  11   Thread 536908344 (mbox_wq #0 QUEUED PRIO 0)    __set_BASEPRI (basePri=0)
    at /Users/todi/bloomlife/zephyr-workspace/modules/hal/cmsis/CMSIS/Core/Include/cmsis_gcc.h:1316
* 12   Remote target                                  0x0009c604 in arch_system_halt (reason=4) at /Users/todi/bloomlife/zephyr-workspace/zephyr/kernel/fatal.c:30

The assert message is:  `ASSERTION FAIL [net_buf_simple_headroom(buf) >= len] @ WEST_TOPDIR/zephyr/lib/net_buf/buf_simple.c:341`

The call stack is as follows:

Thread 12 received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Remote target]
0x0009c604 in arch_system_halt (reason=4) at /Users/todi/bloomlife/zephyr-workspace/zephyr/kernel/fatal.c:30
30		for (;;) {
(gdb) bt
#0  0x0009c604 in arch_system_halt (reason=4) at /Users/todi/bloomlife/zephyr-workspace/zephyr/kernel/fatal.c:30
#1  0x000653c8 in k_sys_fatal_error_handler (reason=4, esf=0x20035608 <sys_work_q_stack+3344>) at /Users/todi/bloomlife/zephyr-workspace/zephyr/kernel/fatal.c:44
#2  0x000655c2 in z_fatal_error (reason=4, esf=0x20035608 <sys_work_q_stack+3344>) at /Users/todi/bloomlife/zephyr-workspace/zephyr/kernel/fatal.c:119
#3  0x00022ad0 in z_arm_fatal_error (reason=4, esf=0x20035608 <sys_work_q_stack+3344>) at /Users/todi/bloomlife/zephyr-workspace/zephyr/arch/arm/core/fatal.c:86
#4  0x00084580 in z_do_kernel_oops (esf=0x20035608 <sys_work_q_stack+3344>, callee_regs=0x2)
    at /Users/todi/bloomlife/zephyr-workspace/zephyr/arch/arm/core/fatal.c:131
#5  0x00023dae in _oops () at /Users/todi/bloomlife/zephyr-workspace/zephyr/arch/arm/core/cortex_m/swap_helper.S:318
#6  <signal handler called>
#7  assert_post_action (file=0xa34e4 "WEST_TOPDIR/zephyr/lib/net_buf/buf_simple.c", line=341) at /Users/todi/bloomlife/zephyr-workspace/zephyr/lib/os/assert.c:44
#8  0x0002574c in net_buf_simple_push (buf=0x20036904 <drivers::details::_net_buf_mqtt_sn_messages_over_l2cap+12>, len=4)
    at /Users/todi/bloomlife/zephyr-workspace/zephyr/lib/net_buf/buf_simple.c:341
#9  0x00088fa8 in net_buf_push (buf=0x200368f8 <drivers::details::_net_buf_mqtt_sn_messages_over_l2cap>, len=4)
    at /Users/todi/bloomlife/zephyr-workspace/zephyr/include/zephyr/net_buf.h:2058
#10 0x0003355a in l2cap_data_pull (conn=0x200197d8 <acl_conns>, amount=4294967295, length=0x2003577c <sys_work_q_stack+3716>)
    at /Users/todi/bloomlife/zephyr-workspace/zephyr/subsys/bluetooth/host/l2cap.c:946
#11 0x00030790 in bt_conn_tx_processor () at /Users/todi/bloomlife/zephyr-workspace/zephyr/subsys/bluetooth/host/conn.c:1062
#12 0x0002c448 in tx_processor (item=0x20002a10 <tx_work>) at /Users/todi/bloomlife/zephyr-workspace/zephyr/subsys/bluetooth/host/hci_core.c:4743
#13 0x00067d14 in work_queue_main (workq_ptr=0x2001a1a0 <k_sys_work_q>, p2=0x0, p3=0x0) at /Users/todi/bloomlife/zephyr-workspace/zephyr/kernel/work.c:688
#14 0x0007e910 in z_thread_entry (entry=0x67b91 <work_queue_main>, p1=0x2001a1a0 <k_sys_work_q>, p2=0x0, p3=0x0)
    at /Users/todi/bloomlife/zephyr-workspace/zephyr/lib/os/thread_entry.c:48
#15 0xfffffffe in ?? ()

When inspecting the buffer that caused the assert, it turns out, that `len` and `size` have values that I would expect, while `data` and `__buf` are set to 0!:

#8  0x0002574c in net_buf_simple_push (buf=0x20036904 <drivers::details::_net_buf_mqtt_sn_messages_over_l2cap+12>, len=4)
    at /Users/todi/bloomlife/zephyr-workspace/zephyr/lib/net_buf/buf_simple.c:341
341		__ASSERT_NO_MSG(net_buf_simple_headroom(buf) >= len);
(gdb) p/x *buf
$1 = {data = 0x0, len = 0x49a, size = 0x1028, __buf = 0x0}

Could it be, that there is a bug in the l2cap layer for this corner case? I see no way to prevent that on my side. I there a unit test, that would cover this situation?

best regards

Torsten

Parents
  • Hi Torsten,

    When inspecting the buffer that caused the assert, it turns out, that `len` and `size` have values that I would expect, while `data` and `__buf` are set to 0!:

    Can you please elaborate on this. The assert triggered means that net_buf_simple_headroom(buf) was less than the len. If you think there should have been enough space in that buffer, then we need to investigate how the buffer got full or why this API gave wrong info. Could be stackoverflow symptoms?

  • Hi  ,

    there was for sure enough headroom for the l2cap protocol, when the l2cap message was added to the queue (using `bt_l2cap_chan_send()`), as I'm able to send thousands of l2cap SDUs without problem.

    The reason, why the calculation of the headroom results in 0 (and thus is small than len) is, because the headroom is defined to be the difference between `data` and `__buf`. As booth are 0 (which I think is most likely the reason for the failure), the difference and thus the headroom is 0.

    The pool for all messages are statically allocated and as `data` and `__buf` are the first and the last member, I would rule out some accidental overwriting. I looked into the buffer library but also did not found any single function, that sets both pointer to zero.

    As you can see from that call stack, the assert is triggered from a function that runs in a BLE stack internal thread. So there is hardly anything, I can do to prevent it.

    thanks and best regards

    Torsten

  • Hi Torsten,

    I unfortunately don't have anything to add to the discussion. I'm writing just to let you know that Susheel has unfortunately been sick, but he will return next week and continue working on this. Our apologies for the inconvenience.

    Best regards,

    Hieu

  • Wow, Torsten, seems like you have done some solid debugging already, sorry to hear about the pain caused by the tools on Mac. I am using windows and the tools work ok.

    I have not tried to reproduce this but it seems like theassertion failure occurs because the net_buf_simple_push() function in buf_simple.c is being called on an SDU that has already been marked as freed.

    Before I dive into this, I need to able to reproduce this.

  • Unfortunately, I was not able to reproduce this with a small example. In `l2cap_data_pull`, at the beginning, the next PDU is pulled out of the transit queue, using `k_fifo_peek_head`, which leaves the PDU on the queue. Later in the same function, if `last_frag` and `last_seg`, `k_fifo_get` is used to actually remove the PDU from the queue. I've a PDU is returned but not removed from the queue, the trouble begins (same PDU is pulled twice from the queue. First time it was marked as freed, second time, the assert will fire).

    So my last resort quick fix for our upcoming release will be, to change the function to use `k_fifo_get` also, when the connection state is indicating that the connection is closed. But I don't know if that has other side effects.

  • I understand your concern, the side effect might be that there could be a double freeing of PDU if you do this wrong. Consider using k_fifo_peek_head first not to affect the queue while enquiring stuff.

  • Hi  

    did you made any progress on this issue? Are you able to reproduce this?

    best regards

    Torsten

Reply Children
  • Torsten, Sorry no I did not try to reproduce this on my end.

    Normal process is that we get a minimalistic project from the users who have issues so that we can spend our time in debugging and not constructing the project and its configuration. It happened so many times that we spend huge amount of times trying to get close to the user setup, but the time we spend on constructing the environment does not guarantee that we are still able to reproduce. Best is that you give a minimalistic project for me to reproduce and i can debug and take it to the dev team.

  • Hi Susheel,

    OK, I was missing that information and that's why I also asked for unit tests. If there where unit tests, either that case could be covered by them, or it might be relatively easy to create a test for that case if similar tests already exists.

    Anyway: I will put that as a TODO on our side and see, when we are able to work further on the reproducer.

    best regards

    Torsten

  • Hi  ,

    please find the reproducer here: https://github.com/TorstenRobitzki/l2cap_reproducer

    It is bases on one of the official examples (GATT CGM). It accepts a L2CAP connection and once connected, tries to send as much 4KByte large SDU as possible.

    There is a python script, that used a HCI controller on a nrf52 eval board, to connect to the reproducer.

    Once, the HCI controller, is powered off, the connection times out and the L2CAP layer on the reproducer hits the assert.

    Let me know, when you have further questions or need help with the setup.

    best regards

    Torsten

  • This bug-fix might fix the problem:

    --- a/subsys/bluetooth/host/l2cap.c
    +++ b/subsys/bluetooth/host/l2cap.c
    @@ -959,8 +959,9 @@ struct net_buf *l2cap_data_pull(struct bt_conn *conn,
             * static channels don't have the concept of L2CAP segments.
             */
            bool last_seg = lechan->_pdu_remaining == pdu->len;
    +       bool sdu_end = ( last_frag && last_seg ) || conn->state != BT_CONN_CONNECTED;
     
    -       if (last_frag && last_seg) {
    +       if (sdu_end) {
                    LOG_DBG("last frag of last seg, dequeuing %p", pdu);
                    __maybe_unused struct net_buf *b = k_fifo_get(&lechan->tx_queue, K_NO_WAIT);
     
    @@ -968,7 +969,6 @@ struct net_buf *l2cap_data_pull(struct bt_conn *conn,
            }
     
            if (last_frag && L2CAP_LE_CID_IS_DYN(lechan->tx.cid)) {
    -               bool sdu_end = last_frag && last_seg;
     
                    LOG_DBG("adding %s callback", sdu_end ? "`sdu_sent`" : "NULL");
                    /* No user callbacks for SDUs */
    

Related