Zephyr UART Async receive drops a single character

I'm using the async API for UART and I have noticed that, shortly after initialisation, I receive a single character with the UART_RX_READY event. This character is 0xff - and I suspect that's just the GPIO lines changing state. This character is easy to ignore, however it seems to break the internal state of the RX API. All future calls will have an offset of +1, but the total length (of all API requests in aggregate) will be correct. This causes us to ignore the first character in our buffer, and take one more than we should (not out-of-bounds, but stale/uninitialised memory).

I suspect we're getting an interrupt that we received data during the initialisation process, and that's throwing us off.

To fix this, I wait for the first UART_RX_READY event, call uart_rx_disable(), and then wait for the UART_RX_DISABLED message. Once that's done, I re-enable with uart_rx_enable() and the system seems to work reliably. I've got four of these devices in the field reading a couple thousand octets per day, and so far it seems to be working just fine!

I'm using SDK v2.2.0, but didn't see any relevant changes in the changelog.

Here's some partial code which has the complete TX/RX path:

static int setup(void) {
    int rc;

    LOG_DBG("Initialising serial tunnel...");

    rc = gpio_pin_configure_dt(&tunnel_power_pin_spec, GPIO_OUTPUT_INACTIVE);
    if (rc < 0)
    {
        LOG_ERR("Failed to configure power control pin.");
        return rc;
    }

    rc = gpio_pin_configure_dt(&tunnel_power_led_spec, GPIO_OUTPUT_INACTIVE);
    if (rc < 0)
    {
        LOG_ERR("Failed to configure power LED.");
        return rc;
    }

    if (!device_is_ready(tunnel_port))
    {
        LOG_ERR("Serial tunnel device is not ready!");
        return -ENODEV;
    }

    rc = uart_callback_set(tunnel_port, uart_callback, NULL);
    if (rc < 0)
    {
        LOG_ERR("Failed to set UART callback.");
        return rc;
    }

    /* Set up the initial receive buffer. */
    rx_buf1.is_in_use = true;
    rx_buf2.is_in_use = false;
    memset(rx_buf1.buffer, '?', sizeof(rx_buf1.buffer));
    memset(rx_buf2.buffer, '?', sizeof(rx_buf2.buffer));
    rc = uart_rx_enable(tunnel_port, rx_buf1.buffer, sizeof(rx_buf1.buffer), CONFIG_OPITO_SERIAL_TUNNEL_RECEIVE_TIMEOUT_US);
    if (rc < 0) {
        LOG_ERR("Failed to enable UART RX.");
        return rc;
    }

    return 0;
}


static void uart_callback(const struct device *dev, struct uart_event *evt, void *user_data) {
    /* There's a strange bug with the DMA backing this async UART driver.
     * We get a single octet on boot (likely caused by a wiggle on the UART
     * line), and that introduces an off-by-one error that never goes away.
     * The fix seems to be resetting the UART after we get that character.
     */
    static bool destroyed_first_message = false;

    switch (evt->type) {
    case UART_RX_RDY:
    {
        if (!destroyed_first_message) {
            uart_rx_disable(dev);
            return;
        }

        struct opito_serial_tunnel_module_event *msg = new_opito_serial_tunnel_module_event();

        const uint8_t *ptr = &evt->data.rx.buf[evt->data.rx.offset];
        memcpy(msg->data.data_raw.buffer, ptr, evt->data.rx.len);
        msg->data.data_raw.length = evt->data.rx.len;
        msg->data.data_raw.timestamp = k_uptime_get();
        msg->type = OPITO_EVT_SERIAL_TUNNEL_DEVICE_DATA_RX;
        APP_EVENT_SUBMIT(msg);
        break;
    }

    case UART_RX_BUF_REQUEST:
    {
        if (!rx_buf1.is_in_use) {
            rx_buf1.is_in_use = true;
            uart_rx_buf_rsp(dev, rx_buf1.buffer, sizeof(rx_buf1.buffer) - 1);
        } else if (!rx_buf2.is_in_use) {
            rx_buf2.is_in_use = true;
            uart_rx_buf_rsp(dev, rx_buf2.buffer, sizeof(rx_buf2.buffer) - 1);
        }

        /* If we didn't manage to hand over a buffer, we'll get the disabled event. */
        break;
    }

    case UART_RX_BUF_RELEASED:
        struct rx_buf *buf = CONTAINER_OF(evt->data.rx_buf.buf, struct rx_buf, buffer);
        buf->is_in_use = false;
        break;

    case UART_RX_DISABLED:
        rx_buf1.is_in_use = true;
        rx_buf2.is_in_use = false;
        uart_rx_enable(tunnel_port, rx_buf1.buffer, sizeof(rx_buf1.buffer), CONFIG_OPITO_SERIAL_TUNNEL_RECEIVE_TIMEOUT_US);

        if (!destroyed_first_message) {
            destroyed_first_message = true;
        } else {
            /* This is buffer overrun - as we weren't able to pass another buffer to the UART! */
            SEND_ERROR(opito_serial_tunnel, OPITO_EVT_SERIAL_TUNNEL_ERROR, -ENOBUFS);
        }
        break;

    default:
        break;
    }
}
Let me know if I've missed anything! For now, our workaround is fine - but it seems like a bug that will bite again in future :)

Parents
  • Hi dan_opito,

    I cannot see anything obviously standing out in your code just by reading, so I guess I will need to try to reproduce it. Do you observe this issue on a nRF9160?

    Meanwhile, the Bluetooth: Peripheral UART sample handles UART event somewhat like your workaround, so I wonder if it is how it is supposed to work. 

    I don't have a nRF9160 DK ready and also ran out of time today, but I will continue to look into this tomorrow.

    By the way, are your RX Buffers something like this?

    struct rx_buf
    {
    	bool is_in_use;
    	uint8_t buffer[TEST_BUFFER_SIZE];
    };
    
    struct rx_buf rx_buf1 = {0};
    struct rx_buf rx_buf2 = {0};

    Hieu

  • Yes, sorry, this is on our nRF9160 board. We also have an nRF91 Thingy - not sure we have the DK but I think this is likely to be hardware independant. In case not, here's the device tree overlay for it:

    &uart2 {
    	status = "okay";
    	current-speed = < 4800 >;
    
    	pinctrl-0 = <&tunnel_default>;
    	pinctrl-1 = <&tunnel_sleep>;
    	pinctrl-names = "default", "sleep";
    };
    
    &pinctrl {
    	tunnel_default: tunnel_default {
    		group1 {
    			psels = <NRF_PSEL(UART_TX, 0, 15)>,
    				    <NRF_PSEL(UART_RX, 0, 13)>;
    			bias-pull-up;
    			// psels = <NRF_PSEL(UART_TX, 0, 24)>,
    			// 	    <NRF_PSEL(UART_RX, 0, 23)>;
    		};
    	};
    
    	tunnel_sleep: tunnel_sleep {
    		group1 {
    			psels = <NRF_PSEL(UART_TX, 0, 15)>,
    					<NRF_PSEL(UART_RX, 0, 13)>;
    			// psels = <NRF_PSEL(UART_TX, 0, 24)>,
    			// 	    <NRF_PSEL(UART_RX, 0, 23)>;
    			bias-pull-up;
    			low-power-enable;
    		};
    	};
    };

    You've figured out the RX buffer structure - here's my definition for completeness:

    /* Receive buffer structure to manage our data from the device. */
    /* NOTE: CONFIG_OPITO_SERIAL_TUNNEL_RECEIVE_BUFFER_LENGTH is set to 256 */
    
    struct rx_buf {
        /* Holds the received data. */
        uint8_t buffer[CONFIG_OPITO_SERIAL_TUNNEL_RECEIVE_BUFFER_LENGTH];
        /* True when the UART driver is using the buffer. */
        bool is_in_use;
    };
    
    /* Our ping-pong buffers for receiving data from the device. */
    static struct rx_buf rx_buf1;
    static struct rx_buf rx_buf2;

    The Bluetooth example appears to be disabling once it receives a full packet - but that very well could mask the issue. When reproducing, I'd suggest treating the data as a never-ending stream (rather than packetised transport) - in which case you would never disable the buffer.

    It may well be that I've misunderstood how it's supposed to work - but I feel like disabling UART RX is opening ourselves to dropped characters?

  • Thank you for the information. I too don't think this issue is exclusive to hardware.

    I couldn't get to reproducing it as I had planned, so I don't have anything meaningful to feedback to you yet. My apology. I will continue at it and update you as soon as I find anything.

Reply Children
  • Hi dan_opito,

    Sorry for the wait. I ended up getting sick from last weekend and have been only able to work for 1-3 hours each day since.

    I was able to create a test setup with my 2x nRF52840 DKs. I produced an issue, however I am not sure if it is the same issue. My issue happens when a buffer is full and the second buffer is switched in. Then I see that the buffer is written correctly, but the offset and length are 1 short in some situations.

    This happen regardless of whether the last or first byte transmitted is 0xFF or not.

    Interestingly, no byte is lost... However, it does lead to an issue that one byte is forever detected late by one transmission the entire process, until, for again unclear reasons, the state of the buffer becomes correct again.

    I will have to discuss this issue to our internal team for analysis. However, as I am on sick leave, that will have to be next week. My apology for this inconvenience.

    If the issue I described here is not the same as yours, could you please help me create a minimal setup to reproduce it?

    Here is the project I used to observe the issue above. It is based on your code, but I made some renames as I debug it. You could create your issue reproducing project based on it.

    In this project, the application logs to UART0 and RTT, and the communication that is tested is over UART1.

    c309490_uart_char_230621_02.zip

    Edit: Fix corrupted attachment. 

    Hieu

  • P.s: I notice that in handling of UART_RX_BUF_REQUEST, you provide the buffers one byte shorter than their actual length. Is there a particular reason for this?

        case UART_RX_BUF_REQUEST:
        {
            if (!rx_buf1.is_in_use) {
                rx_buf1.is_in_use = true;
                uart_rx_buf_rsp(dev, rx_buf1.buffer, sizeof(rx_buf1.buffer) - 1);
            } else if (!rx_buf2.is_in_use) {
                rx_buf2.is_in_use = true;
                uart_rx_buf_rsp(dev, rx_buf2.buffer, sizeof(rx_buf2.buffer) - 1);
            }
    
            /* If we didn't manage to hand over a buffer, we'll get the disabled event. */
            break;
        }

  • No issue with any delays from our side. The work around is working really well and I haven't noticed any issues. We're not streaming lots of data - but we do have a few devices reporting every 10 minutes that have remained working consistently.

    WRT the off-by-one on the buffer size - that's likely left over from me trying to figure out why I was always missing an octet.

    Your description above sounds like you may have reproduced the issue. I'm working on another product which will use the same UART and there will be 50 or so of those deployed over the next couple of months - plenty of test vectors and these ones will have higher bandwidth I believe.

    If you find a better fix, or if you find the underlying issue, I can push updates out to our devices after a bit of testing in our lab. Thanks for your efforts on this bug!

  • Hi dan_opito,

    It has been a while, so I want to give you an update. I have reported the issue internally. However, as developers are away on summer vacations, there has not been any feedback yet.

    On that note, due to the same reasons, we are currently understaffed, and further response will be delayed. My apology for the inconvenience.

  • No problem, Hieu. I appreciate you taking the time to look into this bug for me. In the meantime, my workaround is working great - we've got this on a couple of different products and I haven't noticed any issues.

Related