We are experiencing an issue in which the available number of HCI Driver ACL buffers reaches 0 during modes of operation on our device that require higher volume of traffic over BLE. This highlighted some design issues we had but the failure mode that occurred as a result was concerning. When the buffer becomes full, we're seeing that other processes in they system that are responsible for pulling data out of the buffer and freeing them up are no longer running properly. After a period of time (5-10 seconds) of running in the state of a full buffer, a piece of code in the BT RX thread issues an SVC instruction and the application ends up running the exception handler. While I understand that getting to this state highlights some errors in our design, I wouldn't expect this scenario to result in an exception if the management of buffers was being done properly. Instead, I would expect communication failures to occur, with errors bubbling up the stack so that the application becomes aware and can address or report the issue.
Below is a brief description of our configuration and the operational scenario that can result in this error.
- Our device is the peripheral device in the BLE connection and the central device is a Samsung S20 FE.
- The negotiated MTU size is 23 Bytes, allowing for 20 byte payloads. Nominally, we would negotiate 247 byte MTUs. However, in this case, the 247 byte MTU was not negotiated properly, resulting in the use of the default 23 byte MTU.
- We currently have the following ACL buffer KCONFIG settings:
CONFIG_BT_BUF_ACL_TX_COUNT=40
CONFIG_BT_BUF_ACL_RX_COUNT=6
CONFIG_BT_HCI_ACL_FLOW_CONTROL=y
CONFIG_BT_BUF_ACL_TX_SIZE=251
CONFIG_BT_BUF_ACL_RX_SIZE=251
- We are in the middle of a firmware update using a proprietary protocol, sending commands from our mobile app to our device that have 941 byte payloads. With the 23 byte MTU (20 bytes of available payload for us), it requires 48 MTUs to send the 941 bytes down. When using the BLE sniffer, it looks like these are mostly being stuffed into a single connection interval, sometimes taking 2 connection intervals to complete.
I built the application with the settings CONFIG_BT_DEBUG_HCI_DRIVER and CONFIG_NET_BUF_LOG enabled and observed that the available buffer space does regularly drop to 1 but it normally frees up as data is pulled from it and doesn't hang. I have attached a file with the output of the RTT that captures the failure case. In this case, the available buffer space drops to 0 and then never recovers. This starts at time stamp 1:56.891,174 in the file. The last message received in the RTT output is at 2:05.079,925, as the application then enters the exception handler due to the SVC instruction.
We increased the size of the ACL RX buffers from 6 to 50 and used a debug variable to monitor the minimum available buffers during the firmware update process. In this case, the lowest it drops to is 36, indicating an increase to 20 could be sufficient for us not to experience the error. But it is concerning that, if we ever did experience the error, that the unit fails the way it does, so this ticket is being written to try and understand why the exception is occurring and whether there might be something in our implementation that can be changed to avoid it.
Below is a set of questions I am looking for answers on:
- Is there a more efficient way to configure the ACL buffers so that, in the case of a 23 byte negotiated MTU, the buffer size adjusts to 27 bytes rather than the 251 configured in the KCONFIG file?
- This would be a more efficient use of RAM and would allow for a deeper buffer for the smaller payloads. We are allocating 6*251 bytes in our KCONFIG file, which would, if used more efficiently, be enough space for buffer depth of 55 when dealing with the 27 byte DLE (23 byte MTU).
- What function in the driver is responsible for freeing the ACL buffers after the data is pulled from them? My understanding is that it would occur at some point in the call stack to the callback that is defined by BT_GATT_CHARACTERISTIC when we setup our receive characteristic within the service definition using BT_GATT_SERVICE_DEFINE. However, I could find the exact location where that would occur within the call stack.
- I also could not find who was issuing the SVC instruction. In the exception handler, I was able to trace the call stack back to the section _oops of the z_arm_svc, but I don't know who is actually issuing the instruction.
- Is there anything in the stack that I can look at that would provide more information as to who might be calling it? I do know that the current thread at the time of the exception is BT RX, which is the cooperative thread used for BLE receive functionality.
Any guidance would be much appreciated. Thanks!