Beware that this post is related to an SDK in maintenance mode
More Info: Consider nRF Connect SDK for new designs
This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Bug or timing issue in nrf_libuarte_async.c?

We are using the Libuarte - advanced UARTE driver and sometimes see crashes in the uart_evt_handler code in nrf_libuarte_async.c when it gets a NRF_LIBUARTE_DRV_EVT_RX_DATA event with no data in the buffer because the sub_rx_count index is pointing just past the end of the buffer.  I.e.- 

p_libuarte->p_ctrl_blk->sub_rx_count == p_evt->data.rxtx.length
We are receiving a few messages per second of various sizes from dozens to a few hundred bytes.  This error occurs after from 30 minutes to a couple of hours of running.  I have a theory about how it is happening and think that perhaps the library code needs a small fix.
The advanced driver is configured to use a TIMER to count the bytes received, and an RTC as a retriggerable timer to watch for timeouts since the last byte was received.  The nrf_libuarte_async.c has separate interrupt handlers for the RTC timeout and for the UART ENDRX event that is triggered when the DMA buffer is full.  We have the RTC timeout set to a very short time (200 microseconds).  What I think is happening is that the timeout is being triggered just as a new byte is being received due to a short gap in the packet.  We should probably set the timeout longer, but I think there is a problem that can show up regardless of the timeout if a new byte comes in at just the right time relative to the timeout.  Here is the sequence of what I think is happening:
- A short gap between bytes of a packet causes the TIMER to generate the timeout interrupt
- At the same time another byte is coming in, just late enough that it doesn't stop the timeout from being triggered, but early enough that by the time the interrupt handler reaches the 
nrf_libuarte_async_timeout_handler line that reads the TIMER's CC the count has increment by one extra byte.
- The new byte gets written to RAM by EasyDMA.  If it has filled the buffer, it triggers another interrupt for the ENDRX event, but it is pending since it has the same priority as the RTC interrupt already being serviced.
- The nrf_libuarte_async_timeout_handler processes all of the newly received data, including the byte that squeezed in at the last moment (between the timeout and the processing of it).  Then it bumps the sub_rx_count index to just beyond the received data.  But because the buffer was filled, the sub_rx_count is just past the end of the buffer.
- The nrf_libuarte_async_timeout_handler finishes, and when it returns, the pending interrupt to handle the ENDRX event calls the uart_evt_handler.  This sees the 
NRF_LIBUARTE_DRV_EVT_RX_DATA event and compares the sub_rx_count to the size of the buffer to see how many bytes were recently received before reaching the buffer boundary.  But because sub_rx_count was bumped up by the timeout handler, it sees that there is no data left to process.  The code flags this as an error and then stops due to the APP_ERROR_CHECK_BOOL(false) on line 276.  We see this happen roughly once per hour (30 minutes to 2 hours+)

I don't understand the reason to throw an error for no data, and it seems like this rare timing alignment can trigger the error when there really is no problem - just a gap about length of the timeout.  (Much shorter or longer than the timeout would be fine.  But the host we are talking to can have variable gaps between bytes sometimes.)  It seems to me that the check for zero data on a NRF_LIBUARTE_DRV_EVT_RX_DATA is not needed, and in fact can intermittently trigger when a problem does not exist.

Is there a way around this without editing the nrf_libuarte_async.c library code?

If it matters, here are some more details:
- NRF5 SDK 17.0.2
- nRF52833
- 115200 baud (so each byte takes about 87 microseconds)
timeout_us = 200
- RTC, TIMER and UARTE interrupt priorities all set to 6
Parents
  • Hi,

    I have forwarded this to the libuarte developer.

  • Followup:  We commented out the 

    APP_ERROR_CHECK_BOOL(false);
    on line 275 of nrf_libuarte_async.c and our application hasn't had a since communications error since then (3 weeks).  Before that it was happening every couple of hours.
  • Hi,

    I got some feedback from the developer:

    The scenario you describe is supposed to be handled by libuarte. TIMER used for counting RXRDY events is latched/captured by PPI from timeout COMPARE event. UARTE is configured to higher interrupt priority than TIMER or RTC so when both happens at once UARTE will be handled first. Timeout handler is disabling UARTE interrupt while processing timeout to avoid interruption by UARTE.

    There is only one think that comes to my mind. Timeout event is triggered by RXRDY event from UART which is generated when byte received, ENDRX event is generated when byte is copied into RAM so it happens few cycles later and may be delayed if there are other masters using RAM (bluetooth?). If it is delayed then handling of timeout may precede ENDRX handling and that is not handled by libuarte. In our tests we don't use other bus masters so we don't see this issue.

    In order to confirm that, I would recommend to add small delay (nrf_delay_us(5)) in `nrf_libuarte_async_timeout_handler` at the beginning before NRFX_IRQ_DISABLE((IRQn_Type)NRFX_IRQ_NUMBER_GET(p_libuarte->p_libuarte->uarte)); this will allow UARTE interrupt to preempt when ENDRX event occurs.

Reply
  • Hi,

    I got some feedback from the developer:

    The scenario you describe is supposed to be handled by libuarte. TIMER used for counting RXRDY events is latched/captured by PPI from timeout COMPARE event. UARTE is configured to higher interrupt priority than TIMER or RTC so when both happens at once UARTE will be handled first. Timeout handler is disabling UARTE interrupt while processing timeout to avoid interruption by UARTE.

    There is only one think that comes to my mind. Timeout event is triggered by RXRDY event from UART which is generated when byte received, ENDRX event is generated when byte is copied into RAM so it happens few cycles later and may be delayed if there are other masters using RAM (bluetooth?). If it is delayed then handling of timeout may precede ENDRX handling and that is not handled by libuarte. In our tests we don't use other bus masters so we don't see this issue.

    In order to confirm that, I would recommend to add small delay (nrf_delay_us(5)) in `nrf_libuarte_async_timeout_handler` at the beginning before NRFX_IRQ_DISABLE((IRQn_Type)NRFX_IRQ_NUMBER_GET(p_libuarte->p_libuarte->uarte)); this will allow UARTE interrupt to preempt when ENDRX event occurs.

Children
No Data
Related