ACL Buffer Overflow Results in SVC Exception Triggered by BT RX Task

We are experiencing an issue in which the available number of HCI Driver ACL buffers reaches 0 during modes of operation on our device that require higher volume of traffic over BLE.  This highlighted some design issues we had but the failure mode that occurred as a result was concerning.  When the buffer becomes full, we're seeing that other processes in they system that are responsible for pulling data out of the buffer and freeing them up are no longer running properly.  After a period of time (5-10 seconds) of running in the state of a full buffer, a piece of code in the BT RX thread issues an SVC instruction and the application ends up running the exception handler.  While I understand that getting to this state highlights some errors in our design, I wouldn't expect this scenario to result in an exception if the management of buffers was being done properly.  Instead, I would expect communication failures to occur, with errors bubbling up the stack so that the application becomes aware and can address or report the issue.

Below is a brief description of our configuration and the operational scenario that can result in this error.

  • Our device is the peripheral device in the BLE connection and the central device is a Samsung S20 FE.
  • The negotiated MTU size is 23 Bytes, allowing for 20 byte payloads.  Nominally, we would negotiate 247 byte MTUs.  However, in this case, the 247 byte MTU was not negotiated properly, resulting in the use of the default 23 byte MTU.
  • We currently have the following ACL buffer KCONFIG settings:

                    CONFIG_BT_BUF_ACL_TX_COUNT=40
                    CONFIG_BT_BUF_ACL_RX_COUNT=6

                    CONFIG_BT_HCI_ACL_FLOW_CONTROL=y

                    CONFIG_BT_BUF_ACL_TX_SIZE=251
                    CONFIG_BT_BUF_ACL_RX_SIZE=251

  • We are in the middle of a firmware update using a proprietary protocol, sending commands from our mobile app to our device that have 941 byte payloads.  With the 23 byte MTU (20 bytes of available payload for us), it requires 48 MTUs to send the 941 bytes down.  When using the BLE sniffer, it looks like these are mostly being stuffed into a single connection interval, sometimes taking 2 connection intervals to complete.

I built the application with the settings CONFIG_BT_DEBUG_HCI_DRIVER and CONFIG_NET_BUF_LOG enabled and observed that the available buffer space does regularly drop to 1 but it normally frees up as data is pulled from it and doesn't hang.  I have attached a file with the output of the RTT that captures the failure case.  In this case, the available buffer space drops to 0 and then never recovers.  This starts at time stamp 1:56.891,174 in the file.  The last message received in the RTT output is at 2:05.079,925, as the application then enters the exception handler due to the SVC instruction.  

We increased the size of the ACL RX buffers from 6 to 50 and used a debug variable to monitor the minimum available buffers during the firmware update process.  In this case, the lowest it drops to is 36, indicating an increase to 20 could be sufficient for us not to experience the error.  But it is concerning that, if we ever did experience the error, that the unit fails the way it does, so this ticket is being written to try and understand why the exception is occurring and whether there might be something in our implementation that can be changed to avoid it.

Below is a set of questions I am looking for answers on:

  • Is there a more efficient way to configure the ACL buffers so that, in the case of a 23 byte negotiated MTU, the buffer size adjusts to 27 bytes rather than the 251 configured in the KCONFIG file? 
    • This would be a more efficient use of RAM and would allow for a deeper buffer for the smaller payloads.  We are allocating 6*251 bytes in our KCONFIG file, which would, if used more efficiently, be enough space for buffer depth of 55 when dealing with the 27 byte DLE (23 byte MTU).
  • What function in the driver is responsible for freeing the ACL buffers after the data is pulled from them?  My understanding is that it would occur at some point in the call stack to the callback that is defined by BT_GATT_CHARACTERISTIC when we setup our receive characteristic within the service definition using BT_GATT_SERVICE_DEFINE.  However, I could find the exact location where that would occur within the call stack.
  • I also could not find who was issuing the SVC instruction.  In the exception handler, I was able to trace the call stack back to the section _oops of the z_arm_svc, but I don't know who is actually issuing the instruction. 
    • Is there anything in the stack that I can look at that would provide more information as to who might be calling it?  I do know that the current thread at the time of the exception is BT RX, which is the cooperative thread used for BLE receive functionality.

Any guidance would be much appreciated.  Thanks!

RTT Output.txt

  • Hi!

    the BT RX thread issues an SVC instruction and the application ends up running the exception handler.

    Do you have any log output that prints the exception info?

    What version of nRF Connect SDK are you using?

    When using the BLE sniffer

    Do you have a sniffer log for when this issue occurs? Could you share the sniffer log?

  • Hello, it has been a long time since I visited on this issue but I wanted to return to it so I can close things out.  We have been running with increased MTU buffers and resolved the MTU size negotiations in our build and things are working fine.  However, I would like to understand the root cause of the issue a little better in case something similar happens in the future.  I have added responses to the questions I can answer but need a little guidance with one of them.  Please see below:

    the BT RX thread issues an SVC instruction and the application ends up running the exception handler.

    Do you have any log output that prints the exception info?

    Is there a logging configuration that I can use to enable printing of this information?  Is there a specific piece of information you are looking for?  I believe when I get into the handler, all that is known is that an SVC instruction was issued.  But if you have something I can add to try to get more information, I'm all ears.

    What version of nRF Connect SDK are you using?

    Zephyr:  v3.3.99-ncs1-rc2

    nrfxlib:  v2.4.0

    nrf:  v2.4.0

    When using the BLE sniffer

    Do you have a sniffer log for when this issue occurs? Could you share the sniffer log?

    Sniffer capture is attached.  You can see the last command is responded to at line no. 168943.  A command is sent down starting at 168944 and ending at 169038.  This command is not supposed to have a response.  But the command sent down on 169046 should and does not.  Retries are attempted every two seconds thereafter for 100 seconds.  At that point, the app that is managing the firmware update seems to get into a bad state and the retries start to come at a higher frequency.  But, the exception occurs in the 5-10 seconds after the target stops responding.

    FirmwareUpdate553b013to554 (S9).pcapng

Related