connectivity_bridge: malfunction due to improper use of system work queue

After upgrading nRF Connect SDK to v2.1.0 I noticed that my modified version of connectivity bridge (when running on thingy:91 nrf52840), stops working 30-60 seconds after a BLE client is connected.

nrf9160 runs the original gnss sample, UART0.TX is connected to UART0.RX on nrf52840; connectivity_bridge forwards it to USB UART and BLE.

The changes I've made to connectivity_bridge are pretty minor: the UARTE driver from Zephyr is replaced with my own implementation, which is well tested, more reliable, easier to use and offers better performance, lower latency and CPU usage. Due to the lower latency of my UART driver, the receive call returns as soon as data is available. This generates a large number of small packets per second, that are sent over to BLE, and results in an increased probability of reproducing an existing design flaw in connectivity_bridge.

To make the case more predictable, I replaced my UART driver with a synthetic test: send a burst of 10 packets, 40 bytes each with a pause of 2 ms, then wait for 1 second, and repeat. The proof of concept code called connectivity_bridge_poc is attached bellow. To switch between synthetic text and UART: change #define UART_TEST in uart_handler.c. To enable the fix, #define BLE_FIX 1 in ble_handler.c.

connectivity_bridge_fix.7z

The following debug trace is produced, see crash-log.txt, and crash-log.png

Notice how inside the system work queue bt_send_work_handler, a warning is printed before the bt_nus_send call, followed by an info message. This is used to track when bt_nus_send is entered and exit. In the trace there is an enter event at 24.562,133, which is not followed by an exit event. At this moment, the system work queue is blocked by bt_nus_send, and as a result event handlers are no longer called to free slab buffers and continue normal communication. The following error messages indicate that the synthetic test continues to queue events with data to be sent, and when 250 events are queued, the system reboots. With the test disabled, the UART task will starve for slab buffers and stop. If the BLE client disconnects, normal operation is restored.

A solution to this issue is provided in connectivity_bridge_thingy, where a dedicated thread is defined for BLE to process bt_nus_send calls. The thread sleeps when no data needs to be send. Instead of queueing work to the system queue, a call is made to wake the processing thread, which then calls bt_send_work_handler, and bt_nus_send. There is also a variant for nrf9160dk inside connectivity_bridge_9160dk, boards packages are provided to connect nrf9160 UART1 to nrf52840.

This opens the question: why does bt_nus_send stall the system work queue if it works correctly when called from a user thread? The likely answer to this is provided in Workqueue Best Practices https://docs.zephyrproject.org/3.0.0/reference/kernel/threads/workqueue.html#workqueue-best-practices

There are certain actions that should be avoided when running in the context of a system work queue, due to the fact that not all system services are available, and it is very likely that bt_nus_send is making such calls, e.g. whenever the BLE hardware is busy and the call needs to wait. Wait and synchronous mutex/semaphore access must not be performed and this context.

Nordic Semiconductor is advised to review usb_cdc_handler.c and determine is the cdc_dtr_work_handler performs any actions that are not safe inside of a system work queue.

Advice on writing reliable code: note how connectivity bridge attempts to detach every single component: UART, USB CDC and BLE. Each component needs large buffers to avoid drops, yet the buffers are not utilised well when a large number of small packets are transferred, so data may still be dropped. The code is over complicated with data travelling several hops until it arrives at its destination. Debugging and tracing is complicated, and additionally discouraged by the fact that the BLE soft device triggers a reboot if we try to step in the debugger. With my UART driver I can receive and send in the same task and this will never drop any bytes. The driver processes all requests in the background and allows the task to send and receive more packets.

My first attempt to use an UART in nRF Connect SDK resulted in 300 lines of complex code copied from an existing sample. Where all I needed was a simple synchronous send. That's a good example of bad SDK coding practice. Most users need a simple send and don't want to be bothered to service interrupts.

connectivity_bridge has been an open issue since 2018, many fixes were attempted by Nordic engineers, yet it still remains unreliable four years later. Feel free to integrate my code in nRF SDK and fix it for good. I will be happy to see official support for nrf9160dk.

Note: the UART pin configuration for nrf52840 on nrf9160dk is intentionally crossed, to simplify the connectivity_bridge implementation, where data received from nrf9160 is sent over the same UART to USB and vice versa.
nrf52840 UART0.RX connects to nrf9160 UART1.TX
nrf52840 UART1.TX connects to nrf9160 UART1.RX
nrf52840 UART1.RX connects to USB UART.TX
nrf52840 UART0.TX connects to USB UART.RX

Parents Reply Children
No Data
Related