This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

nRF51422 SDK11 app_uart_fifo race condition and fix

Yet another app_uart_fifo question. I'm using SDK 11 with uVision. It is my expectation that, if using HW flow control and a FIFO to RX, no received characters should be dropped at all, under any conditions.

I've noticed errors thrown (UART driver error, FIFO error) and characters lost in the following scenario:

  • nRF51DK board
  • serial configured in HW flow control mode, 115200 baud, connected to Segger chip (default config) on pins 11 (RX) 8 (RTS) 10 (TX) 9 (CTS). This would be the default setup, except a higher bitrate.
  • no softdevice used
  • using RX buffer of 32 bytes, but larger values only delay the issue.

The app flow is as follows:

  • from uart handler when data is available I'm setting a flag which is atomically checked from main
  • from main I'm reading the UART. When I have exactly 20 bytes, I'm printing them out (just for debugging, issue happens even if I don't print, just sleep from time to time). Then main goes to WFE() for the next 20B packet.

The input data is as follows: 1234567890ABCDEFGHIJ repeated 150 times (20Bx150=3KB), pasted in a terminal all at once as a long line no separators.

I've scoped it and I know bytes are properly being sent from the Segger bridge, RTS works correctly. Further more, if I count the uart driver RX interrupts, I get 3000 so all bytes actually arrive in the uart interrupt handler.

Digging for a few days I found a race condition in app_uart_fifo.c between uart_event_handler(NRF_DRV_UART_EVT_RX_DONE) and app_uart_get: The ISR handler can interrupt the get function which runs from main at a bad time, resulting in lost bytes and potentially double nrf_drv_uart_rx. This in turn can lead to UART HW FIFO overflow and/or software FIFO overflow.

The main issue in SDK11 is that, in app_uart_get we check for length and read a byte in a non-atomic way. It is a pretty obvious race once you see it.

I've checked SDK12 and while I think it improves the reliability I don't think it resolves the issue 100%.And I don't like it calls the uart app handler for every byte received. I think SDK11, which calls the app uart handler only when queue is empty is very elegant and efficient and preferable to SDK12 implementation.

I'm proposing the following patch which introduces two new functions:

  • in app_fifo.c: uint32_t app_fifo_read_is_full(app_fifo_t * p_fifo, uint8_t * p_byte, uint32_t* p_sz, bool* p_full) - this reads an array of bytes and atomically checks for full queue and moves the read pointer. The critical section is about 800ns.
  • in app_uart_fifo.c: uint32_t app_uart_get_array(uint8_t* p_byte, uint32_t* in_out_size) which uses the first function to read and test for full queue; if queue was full at the time of read, it schedules a new hardware rx.

Since I'm doing the memcpy's outside the critical section it is expected that app_fifo_read_is_full called on a fifo of size N may return <N bytes and is_full = true, if more characters arrive after or during the memcopy and before length was checked. This should not be an issue since application can easily compensate by reading again.

The above functions are only safe to be called from main. I've stressed the link using the attached project over a few hours transferring 10MB+ with no issues.

Here is a link with the project (sorry, couldn't figure how to properly embed code in here) www.dropbox.com/.../app_uart_TEST.zip

Thank you and please excuse the long post.

  • Hi shadow,

    Could you elaborate a little bit more on how you detect the error ? Please correct me if I'm wrong but in app_uart_get() we basically pop one byte out from the fifo and in uart_event_handler(NRF_DRV_UART_EVT_RX_DONE) we add one byte into the fifo. So even if the event handler interrupt the app_uart_get() it shouldn't cause a byte missing or duplicating bytes ? Not 100% sure there is a race condition here.

    One thing we tried to fix in the later SDK is to pop the byte out before checking if it's full or not as we may end up receiving byte before we get the byte out at the end of the function.

  • Bui, sorry if what follows is not super-clear as I said it took me a few days to go thru this.

    I have source and examples to demonstrate each point if that helps.

    First, the condition to start a read is wrong in app_uart_get. I will prove with an example: Let's have a FIFO with N-2 elements. Another character arrives and uart_event_handler(NRF_DRV_UART_EVT_RX_DONE) gets called. Since it adds one element, FIFO now has N-1 elements, the ISR calls nrf_drv_uart_rx.

    app_uart_get gets called, FIFO_LENGTH(m_rx_fifo) == m_rx_fifo.buf_size_mask is true so app_uart_get also calls nrf_drv_uart_rx. Now, if the timing of next byte is just right, we will receive 2 characters and error out with APP_UART_FIFO_ERROR. This is pretty easy to reproduce with a short 32B RX queue and say 100 chars pasted at once into the serial terminal.

    I tried fixing app_uart_get by only starting a new RX when FIFO_LENGTH(m_rx_fifo) > m_rx_fifo.buf_size_mask but of course then we're scheduling a RX before reading the full FIFO, potentially overflowing it before we get to read from it. There is no getting around the fact: we first must read the FIFO before we schedule RX.

    So I modified app_uart_get to read FIFO first and then check for length and schedule RX if length is N-1 (we just read the last element).

    However this does not fix the issue, as next example proves:

    Having a FIFO with N-1 elements. RX is active waiting for the next byte. app_uart_get gets called, dequeues 1B (we're at N-2 bytes in FIFO) and execution is interrupted by ISR delivering the N-1'th byte. ISR also schedules the next RX. Execution is resumed however now FIFO_LENGTH = N-1 which makes it look like we just read the last element and app_uart_get schedules RX. Therefore we again have 2 RX'es scheduled and we may receive 2 bytes, errorin out on the FIFO overflow.

    Proposed solution is to make fifo read-and-test-if-it-was-full an atomic operation.

    Since reading now involves a critical section, I just made a get array for ehnhanced performance.

    PS this html editor is horrible - no newlines!

  • Thanks for the explanation. I agree on the part that we should read the byte out of the buffer before we start RX again. This was the change in the newer SDK as I mentioned above. In the newer SDK we also do not read the number of bytes in the buffer inside app_uart_get() but we use a global variable that being set inside uart_event_handler(NRF_DRV_UART_EVT_RX_DONE) and the fifo is full (m_rx_ovf).

    Doing this we avoid scheduling RX twice when we have only one slot as you described.

    However we may still have a race condition situation when m_rx_ovf is not set when we read it, but then it 's set when we are getting the byte from fifo out. This will result in that we don't schedule RX when we should. But, this can be solved if we keep calling app_uart_get() until the buffer is empty.

  • Thanks. I'll take a closer look at SDK12 implementation. I didn't like that SDK12 calls the application uart callback for every byte received. That is IMHO not needed so will also look if that can be improved to only call the callback if the FIFO is empty, on first received byte.

  • I'm not 100% certain on why we changed the event handling. I assume, in some application we don't want to handle every time there is a byte in the buffer but rather wait til there is a number of byte in the buffer to extract. Then having an event for each of the byte would be useful.

Related