This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Zephyr uarte library missing UART_RX_BUF_RELEASED events

I run newest nRF53 chip as of writing with 1.6.0 nRF Connect SDK from nRF Connect Toolchain manager 3.6.1 using SES 5.50c.

I use a array of 5 rx-buffers (Increased from 3 to figure out what was going on) and handle UART_RX_BUF_REQUEST and UART_RX_BUF_RELEASED events in the registered callback.
Most of the time it runs fine but during a continuous test I notice that at random intervalls it may skip a UART_RX_BUF_RELEASED event which leads my application to detect a request of using a buffer that is not yet freed. If these buffers had been allocated dynamically this would have caused a memory leak. Instead I catch it and log it while actually allowing the reassigning of already assigned buffer.

Since I use the buffers round-robin and process data (transfer to other threads) during any UART_RX_RDY events I do not miss data but this behavior is worrying.
Is there a race-condition in the library?

Parents
  • Hi Kyrre,

    Sounds very much like a race condition. I need to understand more about the chain of events that happened before digging into the code. Could you please help me understand with these questions.

    1. Your application seems to be tracking which buffer is free and not instead of relying solely on these events. Is it possible that the application tracking logic might have a race condition? Could you rule that out?
    2. Could you provide me the chain of UART events that are received by your application (a log file with a way to see EVENTS/buffer mapping would help).
    3. What other application peripheral interrupts are enabled and being used when your application detected that a request arrived for a buffer that is not yet free?
  • My application checks the buffer reference and compares to it's list of buffers to identify the buffer number being freed.
    The buffers are allocated statically so these values/pointers never change.

    Buf FREE x means it has identified the referenced buffer as buffer number x from 0 to 4.

    It always responds to a buffer request with the next buffer in succession and starts the system by setting up the enable rx call using buffer 0. The Buf 1 request event has been logged inbetween something else during startup and thus was missing in the log but happened correctly.

    Since I run a total of 5 buffers to debug this issue I see it usually always frees a buffer immediately once full, however this does not appear to always happen which leads me to believe the problem is either in the hardware or in the library during the handling of the events. I am thinking that there may exist a short timeframe where a flag may be set by hardware buy is cleared by software before or after the execution of the rx events callback etc.

  • Thanks for the detailed explanation. 

    Kyrre Aalerud said:
    My application checks the buffer reference and compares to it's list of buffers to identify the buffer number being freed.

     This above explanation from you tells me that this most likely is a race condition in the uart driver where it missed this released event totally. I will create an internal ticket for the bug.

    It would be interesting info from you if you can confirm if the buffer with the missed released event had its buffer fully/partially filled? and also if the data in that buffer was relavant? Also would it be possible for you to give us a minimalistic project to reproduce this as this seems very timing related. This information will help the engineers focus on the code sections which are which cause this behavior. 

    Once again Thanks a lot for such a detail explanation and your experiments. But for now, since you are not doing dynamic allocation on request events and able to manage the round robin buffers in a good way (and also not losing data due to this bug) we hope that this bug will be fixed in the next releases.

  • for your reference, the internal ticket for the low level driver team to look at this is NCSDK-10801

  • Very rarely I do get events where an odd number of bytes are received. My test uses 4 byte blocks and we use timeout in the driver to create rx events in a timely fashion at the same time as we have a timeout for parsing data in 4 byte blocks. When the odd number of bytes arrive the first bytes are blocks of 4 and valid data. The last byte is garbage and since it is alone within the rx event timeline the next rx event is a normal 4xN byte event again and arrives so late that our timeout is triggered. The incomplete packet is then discarded and the new incoming data is parsed in it's place. The data arriving are ACK's to commands sent to a STSPIN32 based custom brushless motor-controller and it is very important to have reliable communications to it. We have had to use both 2 layers of timing (1 to get timely data parsing, 1 to secure resynchronization of datastream) and packet validation to have this communications work. The length of serial traces are approx 15mm.

    [01:32:09.484,832] <inf> hy5_uart: RX EVENT: rx.len=12 @ 5529484
    [01:32:09.484,863] <err> hy5_uart: Buf FREE: 1 array: 1 0 0 1 1
    [01:32:09.484,893] <err> hy5_uart: Buf REQ:  3 array: 1 1 0 1 1
    [01:32:09.484,924] <inf> hy5_uart: ACK 134 @ 181190162
    [01:32:09.484,924] <inf> hy5_uart: ACK 135 @ 181190162
    [01:32:09.484,954] <inf> hy5_uart: ACK 136 @ 181190162
    [01:32:09.488,922] <inf> hy5_uart: RX EVENT: rx.len=21 @ 5529488
    [01:32:09.488,983] <inf> hy5_uart: ACK 137 @ 181190295
    [01:32:09.488,983] <inf> hy5_uart: ACK 138 @ 181190295
    [01:32:09.488,983] <inf> hy5_uart: ACK 139 @ 181190295
    [01:32:09.489,013] <inf> hy5_uart: ACK 140 @ 181190295
    [01:32:09.489,013] <inf> hy5_uart: ACK 141 @ 181190296
    [01:32:09.688,323] <inf> hy5_uart: RX EVENT: rx.len=4 @ 5529688
    [01:32:09.688,354] <err> hy5_uart: RX Timeout. 5529688-5529488=200. Discarded incomplete data of 1 byte.
    

Reply
  • Very rarely I do get events where an odd number of bytes are received. My test uses 4 byte blocks and we use timeout in the driver to create rx events in a timely fashion at the same time as we have a timeout for parsing data in 4 byte blocks. When the odd number of bytes arrive the first bytes are blocks of 4 and valid data. The last byte is garbage and since it is alone within the rx event timeline the next rx event is a normal 4xN byte event again and arrives so late that our timeout is triggered. The incomplete packet is then discarded and the new incoming data is parsed in it's place. The data arriving are ACK's to commands sent to a STSPIN32 based custom brushless motor-controller and it is very important to have reliable communications to it. We have had to use both 2 layers of timing (1 to get timely data parsing, 1 to secure resynchronization of datastream) and packet validation to have this communications work. The length of serial traces are approx 15mm.

    [01:32:09.484,832] <inf> hy5_uart: RX EVENT: rx.len=12 @ 5529484
    [01:32:09.484,863] <err> hy5_uart: Buf FREE: 1 array: 1 0 0 1 1
    [01:32:09.484,893] <err> hy5_uart: Buf REQ:  3 array: 1 1 0 1 1
    [01:32:09.484,924] <inf> hy5_uart: ACK 134 @ 181190162
    [01:32:09.484,924] <inf> hy5_uart: ACK 135 @ 181190162
    [01:32:09.484,954] <inf> hy5_uart: ACK 136 @ 181190162
    [01:32:09.488,922] <inf> hy5_uart: RX EVENT: rx.len=21 @ 5529488
    [01:32:09.488,983] <inf> hy5_uart: ACK 137 @ 181190295
    [01:32:09.488,983] <inf> hy5_uart: ACK 138 @ 181190295
    [01:32:09.488,983] <inf> hy5_uart: ACK 139 @ 181190295
    [01:32:09.489,013] <inf> hy5_uart: ACK 140 @ 181190295
    [01:32:09.489,013] <inf> hy5_uart: ACK 141 @ 181190296
    [01:32:09.688,323] <inf> hy5_uart: RX EVENT: rx.len=4 @ 5529688
    [01:32:09.688,354] <err> hy5_uart: RX Timeout. 5529688-5529488=200. Discarded incomplete data of 1 byte.
    

Children
  • Thank you for this data. This is very useful for us to understand. Any chances that you can share your project that we can run on our DK, will make it easy for us to reproduce and debug. If we try to replicate it we might end up spending a lot of time searching for that timing critical event which triggers this.

  • Anything update plz?

    We met the same issue, and yet unable to determinate the reason.

    Currently,

    I found the normal function called chain when uarte rx receiving bytes:

    isr(rx started) -> callback(new buffer rsp) -> isr(end rx) -> callback(buffer release)

    PS: leave out the rx ready event cause it comes quiet frequently and randomly.

    When we met this issue, we found isr(end rx) is not generate(missing endrx event or endrx interrupt mask), thus callback(buffer release) was not execute. i.e. isr(rx started) comes twice.

Related