Assert fires when disconnecting while data is transmitted via L2CAP

I have a nRF5340 using nrf connect v3.7.99-nc1 that is in a peripheral role. I establish a l2cap connection to the peripheral, to request a larger amount of data. That works quite ok until I close that connection from the receiving side while still some l2cap SDUs are on the flight. An asserts fires from within a thread (id 12) that I did not spawned:

(gdb) info thread
  Id   Target Id                                      Frame 
  2    Thread 536977456 (main PENDING PRIO 0)         __set_BASEPRI (basePri=0)
    at /Users/todi/bloomlife/zephyr-workspace/modules/hal/cmsis/CMSIS/Core/Include/cmsis_gcc.h:1316
  3    Thread 536977640 (idle UNKNOWN PRIO 15)        __ISB () at /Users/todi/bloomlife/zephyr-workspace/modules/hal/cmsis/CMSIS/Core/Include/cmsis_gcc.h:261
  4    Thread 536974176 (logging PENDING PRIO 14)     __set_BASEPRI (basePri=0)
    at /Users/todi/bloomlife/zephyr-workspace/modules/hal/cmsis/CMSIS/Core/Include/cmsis_gcc.h:1316
  5    Thread 536974360 (usbworkq PENDING PRIO 255)   __ISB () at /Users/todi/bloomlife/zephyr-workspace/modules/hal/cmsis/CMSIS/Core/Include/cmsis_gcc.h:260
  6    Thread 536977824 (sysworkq QUEUED PRIO 255)    __ISB () at /Users/todi/bloomlife/zephyr-workspace/modules/hal/cmsis/CMSIS/Core/Include/cmsis_gcc.h:260
  7    Thread 536974576 (BT LW WQ PENDING PRIO 10)    __ISB () at /Users/todi/bloomlife/zephyr-workspace/modules/hal/cmsis/CMSIS/Core/Include/cmsis_gcc.h:260
  8    Thread 536977192 (usbd_workq PENDING PRIO 255) __ISB () at /Users/todi/bloomlife/zephyr-workspace/modules/hal/cmsis/CMSIS/Core/Include/cmsis_gcc.h:260
  9    Thread 536976856 (rx_q[0] PENDING PRIO 255)    __ISB () at /Users/todi/bloomlife/zephyr-workspace/modules/hal/cmsis/CMSIS/Core/Include/cmsis_gcc.h:260
  10   Thread 536974792 (BT RX WQ PENDING PRIO 248)   __set_BASEPRI (basePri=0)
    at /Users/todi/bloomlife/zephyr-workspace/modules/hal/cmsis/CMSIS/Core/Include/cmsis_gcc.h:1316
  11   Thread 536908344 (mbox_wq #0 QUEUED PRIO 0)    __set_BASEPRI (basePri=0)
    at /Users/todi/bloomlife/zephyr-workspace/modules/hal/cmsis/CMSIS/Core/Include/cmsis_gcc.h:1316
* 12   Remote target                                  0x0009c604 in arch_system_halt (reason=4) at /Users/todi/bloomlife/zephyr-workspace/zephyr/kernel/fatal.c:30

The assert message is:  `ASSERTION FAIL [net_buf_simple_headroom(buf) >= len] @ WEST_TOPDIR/zephyr/lib/net_buf/buf_simple.c:341`

The call stack is as follows:

Thread 12 received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Remote target]
0x0009c604 in arch_system_halt (reason=4) at /Users/todi/bloomlife/zephyr-workspace/zephyr/kernel/fatal.c:30
30		for (;;) {
(gdb) bt
#0  0x0009c604 in arch_system_halt (reason=4) at /Users/todi/bloomlife/zephyr-workspace/zephyr/kernel/fatal.c:30
#1  0x000653c8 in k_sys_fatal_error_handler (reason=4, esf=0x20035608 <sys_work_q_stack+3344>) at /Users/todi/bloomlife/zephyr-workspace/zephyr/kernel/fatal.c:44
#2  0x000655c2 in z_fatal_error (reason=4, esf=0x20035608 <sys_work_q_stack+3344>) at /Users/todi/bloomlife/zephyr-workspace/zephyr/kernel/fatal.c:119
#3  0x00022ad0 in z_arm_fatal_error (reason=4, esf=0x20035608 <sys_work_q_stack+3344>) at /Users/todi/bloomlife/zephyr-workspace/zephyr/arch/arm/core/fatal.c:86
#4  0x00084580 in z_do_kernel_oops (esf=0x20035608 <sys_work_q_stack+3344>, callee_regs=0x2)
    at /Users/todi/bloomlife/zephyr-workspace/zephyr/arch/arm/core/fatal.c:131
#5  0x00023dae in _oops () at /Users/todi/bloomlife/zephyr-workspace/zephyr/arch/arm/core/cortex_m/swap_helper.S:318
#6  <signal handler called>
#7  assert_post_action (file=0xa34e4 "WEST_TOPDIR/zephyr/lib/net_buf/buf_simple.c", line=341) at /Users/todi/bloomlife/zephyr-workspace/zephyr/lib/os/assert.c:44
#8  0x0002574c in net_buf_simple_push (buf=0x20036904 <drivers::details::_net_buf_mqtt_sn_messages_over_l2cap+12>, len=4)
    at /Users/todi/bloomlife/zephyr-workspace/zephyr/lib/net_buf/buf_simple.c:341
#9  0x00088fa8 in net_buf_push (buf=0x200368f8 <drivers::details::_net_buf_mqtt_sn_messages_over_l2cap>, len=4)
    at /Users/todi/bloomlife/zephyr-workspace/zephyr/include/zephyr/net_buf.h:2058
#10 0x0003355a in l2cap_data_pull (conn=0x200197d8 <acl_conns>, amount=4294967295, length=0x2003577c <sys_work_q_stack+3716>)
    at /Users/todi/bloomlife/zephyr-workspace/zephyr/subsys/bluetooth/host/l2cap.c:946
#11 0x00030790 in bt_conn_tx_processor () at /Users/todi/bloomlife/zephyr-workspace/zephyr/subsys/bluetooth/host/conn.c:1062
#12 0x0002c448 in tx_processor (item=0x20002a10 <tx_work>) at /Users/todi/bloomlife/zephyr-workspace/zephyr/subsys/bluetooth/host/hci_core.c:4743
#13 0x00067d14 in work_queue_main (workq_ptr=0x2001a1a0 <k_sys_work_q>, p2=0x0, p3=0x0) at /Users/todi/bloomlife/zephyr-workspace/zephyr/kernel/work.c:688
#14 0x0007e910 in z_thread_entry (entry=0x67b91 <work_queue_main>, p1=0x2001a1a0 <k_sys_work_q>, p2=0x0, p3=0x0)
    at /Users/todi/bloomlife/zephyr-workspace/zephyr/lib/os/thread_entry.c:48
#15 0xfffffffe in ?? ()

When inspecting the buffer that caused the assert, it turns out, that `len` and `size` have values that I would expect, while `data` and `__buf` are set to 0!:

#8  0x0002574c in net_buf_simple_push (buf=0x20036904 <drivers::details::_net_buf_mqtt_sn_messages_over_l2cap+12>, len=4)
    at /Users/todi/bloomlife/zephyr-workspace/zephyr/lib/net_buf/buf_simple.c:341
341		__ASSERT_NO_MSG(net_buf_simple_headroom(buf) >= len);
(gdb) p/x *buf
$1 = {data = 0x0, len = 0x49a, size = 0x1028, __buf = 0x0}

Could it be, that there is a bug in the l2cap layer for this corner case? I see no way to prevent that on my side. I there a unit test, that would cover this situation?

best regards

Torsten

Parents
  • Hi Torsten,

    When inspecting the buffer that caused the assert, it turns out, that `len` and `size` have values that I would expect, while `data` and `__buf` are set to 0!:

    Can you please elaborate on this. The assert triggered means that net_buf_simple_headroom(buf) was less than the len. If you think there should have been enough space in that buffer, then we need to investigate how the buffer got full or why this API gave wrong info. Could be stackoverflow symptoms?

  • Hi  ,

    there was for sure enough headroom for the l2cap protocol, when the l2cap message was added to the queue (using `bt_l2cap_chan_send()`), as I'm able to send thousands of l2cap SDUs without problem.

    The reason, why the calculation of the headroom results in 0 (and thus is small than len) is, because the headroom is defined to be the difference between `data` and `__buf`. As booth are 0 (which I think is most likely the reason for the failure), the difference and thus the headroom is 0.

    The pool for all messages are statically allocated and as `data` and `__buf` are the first and the last member, I would rule out some accidental overwriting. I looked into the buffer library but also did not found any single function, that sets both pointer to zero.

    As you can see from that call stack, the assert is triggered from a function that runs in a BLE stack internal thread. So there is hardly anything, I can do to prevent it.

    thanks and best regards

    Torsten

  • Hi  ,

    I added this issue to our ticket system. Currently we are shortly before rolling out our product to test customers and it might take some time, before someone (most likely me) is able to pick that ticket up and to create a reproducer.

    best regards

    Torsten

  • Torsten,

    Got it, please write here when you pick this task again and able to help me to reproduce the issue. Good luck with your product test launch.

  • Hi  ,

    we discovered, that in our test environment we must expect BLE disconnects due to the harsh environment, which would then crash our application.

    I've started to write a reproducer. I had two options to go, either strip down our application to a bare minimum, that works on a eval board, or start from scratch (the CGMS from nrf/samples/bluetooth).

    Unfortunately, I toke the second road and now I'm stuck with the configuration. I get Kconfig errors from generated code. I tried to strip down the configuration to a bare minimum. However:

    rm -fr ./build && ZEPHYR_BASE=$(pwd)/../zephyr ZEPHYR_NRF_MODULE_DIR=$(pwd)/../nrf west build  -b"nrf5340dk/nrf5340/cpuapp" --build-dir ./build --sysbuild  .

    results in:

    -- west build: generating a build system
    Loading Zephyr module(s) (Zephyr base): sysbuild_default
    -- Found Python3: /opt/homebrew/opt/[email protected]/bin/python3.11 (found suitable version "3.11.7", minimum required is "3.8") found components: Interpreter 
    -- Cache files will be written to: /Users/todi/Library/Caches/zephyr
    -- Found west (found suitable version "1.2.0", minimum required is "0.14.0")
    -- Board: nrf5340dk, qualifiers: nrf5340/cpuapp
    Parsing /Users/todi/bloomlife/l2cap_reproducer/zephyr/share/sysbuild/Kconfig
    Loaded configuration '/Users/todi/bloomlife/l2cap_reproducer/l2cap_reproducer/build/_sysbuild/empty.conf'
    Merged configuration '/Users/todi/bloomlife/l2cap_reproducer/l2cap_reproducer/sysbuild.conf'
    Configuration saved to '/Users/todi/bloomlife/l2cap_reproducer/l2cap_reproducer/build/zephyr/.config'
    Kconfig header saved to '/Users/todi/bloomlife/l2cap_reproducer/l2cap_reproducer/build/_sysbuild/autoconf.h'
    -- 
       **************************************
       * Running CMake for l2cap_reproducer *
       **************************************
    
    Loading Zephyr default modules (Zephyr base).
    -- Application: /Users/todi/bloomlife/l2cap_reproducer/l2cap_reproducer
    -- CMake version: 3.28.1
    -- Found Python3: /Users/todi/zephyrproject/.venv/bin/python (found suitable version "3.12.3", minimum required is "3.8") found components: Interpreter 
    -- Cache files will be written to: /Users/todi/Library/Caches/zephyr
    -- Zephyr version: 3.7.99 (/Users/todi/bloomlife/l2cap_reproducer/zephyr)
    -- Board: nrf5340dk, qualifiers: nrf5340/cpuapp
    -- ZEPHYR_TOOLCHAIN_VARIANT not set, trying to locate Zephyr SDK
    -- Found host-tools: zephyr 0.16.4 (/Users/todi/zephyr-sdk-0.16.4)
    -- Found toolchain: zephyr 0.16.4 (/Users/todi/zephyr-sdk-0.16.4)
    -- Found Dtc: /opt/homebrew/bin/dtc (found suitable version "1.7.0", minimum required is "1.4.6") 
    -- Found BOARD.dts: /Users/todi/bloomlife/l2cap_reproducer/zephyr/boards/nordic/nrf5340dk/nrf5340dk_nrf5340_cpuapp.dts
    -- Found devicetree overlay: /Users/todi/bloomlife/l2cap_reproducer/l2cap_reproducer/app.overlay
    -- Generated zephyr.dts: /Users/todi/bloomlife/l2cap_reproducer/l2cap_reproducer/build/l2cap_reproducer/zephyr/zephyr.dts
    -- Generated devicetree_generated.h: /Users/todi/bloomlife/l2cap_reproducer/l2cap_reproducer/build/l2cap_reproducer/zephyr/include/generated/zephyr/devicetree_generated.h
    -- Including generated dts.cmake file: /Users/todi/bloomlife/l2cap_reproducer/l2cap_reproducer/build/l2cap_reproducer/zephyr/dts.cmake
    
    /Users/todi/bloomlife/l2cap_reproducer/l2cap_reproducer/build/l2cap_reproducer/zephyr/.config.sysbuild:2: warning: attempt to assign the value 'y' to the undefined symbol PARTITION_MANAGER_ENABLED
    
    /Users/todi/bloomlife/l2cap_reproducer/l2cap_reproducer/build/l2cap_reproducer/zephyr/.config.sysbuild:5: warning: attempt to assign the value 'n' to the undefined symbol BT_FAST_PAIR
    
    /Users/todi/bloomlife/l2cap_reproducer/l2cap_reproducer/build/l2cap_reproducer/zephyr/.config.sysbuild:7: warning: attempt to assign the value 'n' to the undefined symbol CHIP
    ...
    ...
    ...
    warning: NRFX_CLOCK_LFXO_TWO_STAGE_ENABLED (defined at modules/hal_nordic/nrfx/Kconfig:24) has direct dependencies NRFX_CLOCK && HAS_NRFX && 0 with value n, but is currently being y-selected by the following symbols:
     - CLOCK_CONTROL_NRF_K32SRC_XTAL (defined at drivers/clock_control/Kconfig.nrf:36), with value y, direct dependencies <choice CLOCK_CONTROL_NRF_SOURCE> (value: y), and select condition !SOC_SERIES_BSIM_NRFXX && !CLOCK_CONTROL_NRF_FORCE_ALT && <choice CLOCK_CONTROL_NRF_SOURCE> (value: y)
    
    error: Aborting due to Kconfig warnings
    

    You can find the entire code here: https://github.com/TorstenRobitzki/l2cap_reproducer

    The version of the nRF SDK is set to v2.8.0 in west.yml

    Any idea, as to how to create a small stand alone application using nrf5340 without hacking it into the nrf source code (to be able to quickly switch SDK versions)?

  • Ok, got it. Guess, I hit this old bug in build system: devzone.nordicsemi.com/.../389463

    When running `west` and `west` is in the path and installed to be a script pointing to a specific python version, then it might happen, that some required libraries are not installed for that python version that comes with west, even when you use `venv` to setup a local python environment, that contains all required libraries. In the end, some python scripts are running, failing and the failing is ignored, leading to a mess, that is hard to debug.

    Maybe, never calling west directly, but using "python -m west" instead would be a recommendation to prevent such time consuming issues.

    I will get back to you, once, I have the reproducer running.

  • Hi  ,

    I think, I figured out, why the assert is firing. When a connection gets closed, bt_conn_tx_processor() in conn.c is destroying all pending SDUs on the transmit queue (line 1059). It pulls the SDUs from the queue, using l2cap_data_pull() from l2cap.c.

    That l2cap_data_pull() function not just pulls a SDU from the queue, but also adds PDU headers if necessary (that's what causes the assert later) and only removed the SDU from the queue, if the very last fragment of the SDU was transmitted.

    So, if the SDU has multiple fragments and not all fragments where send out, l2cap_data_pull() returns a SDU, but leaves the SDU on the transmit queue. A call to destroy_and_callback() will then set the data and __buf pointers to NULL. In the second iteration of the loop in conn.c, l2cap_data_pull() will be called again and will get the very same SDU from the queue. But this time, the __buf and data are set to NULL and thus, the call to net_buf_push() will fail with an assert.

    Without that assert enable, this will probably fail later with a dereferenced null pointer.

    It took me quite some time, to figure that out. My first attempt was to set watch points on the pointer that get set to zero. Maybe the watch points are not supported by nrf53, or they have to be set explicitly? And I had to restart the debugger for every run, as restarting the debug session ("monitor reset", "continue"), triggered and an assert in GNU gdb (Zephyr SDK 0.16.4) 12.1 (/Volumes/Workspace/build/.build/arm-zephyr-eabi/src/gdb/gdb/infrun.c:5825: internal-error: finish_step_over: Assertion `ecs->event_thread->control.trap_expected' failed.)

    Now, I've tried to figure out, why my example from above is not triggering the issue. But currently, I first have to figure out, why debugging is not working any more. While the very same setup worked a few days ago, now, when I start a debug session:

    Loading section .last_section, size 0x4 lma 0x60fb4
    Start address 0x0000bec4, load size 397218
    Transfer rate: 301 KB/sec, 7944 bytes/write.
    Resetting target
    (gdb) monitor flash breakpoints = 0
    Flash breakpoints disabled
    (gdb) b conn.c:1059
    Breakpoint 1 at 0x1874c: file /Users/todi/bloomlife/l2cap_reproducer/zephyr/subsys/bluetooth/host/conn.c, line 1059.
    (gdb) c
    Continuing.
    Warning:
    Cannot insert breakpoint 1.
    Cannot access memory at address 0x1874c

    Command aborted.

    I don't know, if it is just me on Mac, but all this tooling seem so fragile and buggy and costs soooo much time.

    Anyway, maybe you can already look into this, with the information, that I found out? Maybe, there is a unit test, that should cover this situation or maybe you could modify an existing unit test to cover that situation?

    best regards and thanks in advance
    Torsten

Reply
  • Hi  ,

    I think, I figured out, why the assert is firing. When a connection gets closed, bt_conn_tx_processor() in conn.c is destroying all pending SDUs on the transmit queue (line 1059). It pulls the SDUs from the queue, using l2cap_data_pull() from l2cap.c.

    That l2cap_data_pull() function not just pulls a SDU from the queue, but also adds PDU headers if necessary (that's what causes the assert later) and only removed the SDU from the queue, if the very last fragment of the SDU was transmitted.

    So, if the SDU has multiple fragments and not all fragments where send out, l2cap_data_pull() returns a SDU, but leaves the SDU on the transmit queue. A call to destroy_and_callback() will then set the data and __buf pointers to NULL. In the second iteration of the loop in conn.c, l2cap_data_pull() will be called again and will get the very same SDU from the queue. But this time, the __buf and data are set to NULL and thus, the call to net_buf_push() will fail with an assert.

    Without that assert enable, this will probably fail later with a dereferenced null pointer.

    It took me quite some time, to figure that out. My first attempt was to set watch points on the pointer that get set to zero. Maybe the watch points are not supported by nrf53, or they have to be set explicitly? And I had to restart the debugger for every run, as restarting the debug session ("monitor reset", "continue"), triggered and an assert in GNU gdb (Zephyr SDK 0.16.4) 12.1 (/Volumes/Workspace/build/.build/arm-zephyr-eabi/src/gdb/gdb/infrun.c:5825: internal-error: finish_step_over: Assertion `ecs->event_thread->control.trap_expected' failed.)

    Now, I've tried to figure out, why my example from above is not triggering the issue. But currently, I first have to figure out, why debugging is not working any more. While the very same setup worked a few days ago, now, when I start a debug session:

    Loading section .last_section, size 0x4 lma 0x60fb4
    Start address 0x0000bec4, load size 397218
    Transfer rate: 301 KB/sec, 7944 bytes/write.
    Resetting target
    (gdb) monitor flash breakpoints = 0
    Flash breakpoints disabled
    (gdb) b conn.c:1059
    Breakpoint 1 at 0x1874c: file /Users/todi/bloomlife/l2cap_reproducer/zephyr/subsys/bluetooth/host/conn.c, line 1059.
    (gdb) c
    Continuing.
    Warning:
    Cannot insert breakpoint 1.
    Cannot access memory at address 0x1874c

    Command aborted.

    I don't know, if it is just me on Mac, but all this tooling seem so fragile and buggy and costs soooo much time.

    Anyway, maybe you can already look into this, with the information, that I found out? Maybe, there is a unit test, that should cover this situation or maybe you could modify an existing unit test to cover that situation?

    best regards and thanks in advance
    Torsten

Children
Related