Zephyr UART Async receive drops a single character

I'm using the async API for UART and I have noticed that, shortly after initialisation, I receive a single character with the UART_RX_READY event. This character is 0xff - and I suspect that's just the GPIO lines changing state. This character is easy to ignore, however it seems to break the internal state of the RX API. All future calls will have an offset of +1, but the total length (of all API requests in aggregate) will be correct. This causes us to ignore the first character in our buffer, and take one more than we should (not out-of-bounds, but stale/uninitialised memory).

I suspect we're getting an interrupt that we received data during the initialisation process, and that's throwing us off.

To fix this, I wait for the first UART_RX_READY event, call uart_rx_disable(), and then wait for the UART_RX_DISABLED message. Once that's done, I re-enable with uart_rx_enable() and the system seems to work reliably. I've got four of these devices in the field reading a couple thousand octets per day, and so far it seems to be working just fine!

I'm using SDK v2.2.0, but didn't see any relevant changes in the changelog.

Here's some partial code which has the complete TX/RX path:

static int setup(void) {
    int rc;

    LOG_DBG("Initialising serial tunnel...");

    rc = gpio_pin_configure_dt(&tunnel_power_pin_spec, GPIO_OUTPUT_INACTIVE);
    if (rc < 0)
    {
        LOG_ERR("Failed to configure power control pin.");
        return rc;
    }

    rc = gpio_pin_configure_dt(&tunnel_power_led_spec, GPIO_OUTPUT_INACTIVE);
    if (rc < 0)
    {
        LOG_ERR("Failed to configure power LED.");
        return rc;
    }

    if (!device_is_ready(tunnel_port))
    {
        LOG_ERR("Serial tunnel device is not ready!");
        return -ENODEV;
    }

    rc = uart_callback_set(tunnel_port, uart_callback, NULL);
    if (rc < 0)
    {
        LOG_ERR("Failed to set UART callback.");
        return rc;
    }

    /* Set up the initial receive buffer. */
    rx_buf1.is_in_use = true;
    rx_buf2.is_in_use = false;
    memset(rx_buf1.buffer, '?', sizeof(rx_buf1.buffer));
    memset(rx_buf2.buffer, '?', sizeof(rx_buf2.buffer));
    rc = uart_rx_enable(tunnel_port, rx_buf1.buffer, sizeof(rx_buf1.buffer), CONFIG_OPITO_SERIAL_TUNNEL_RECEIVE_TIMEOUT_US);
    if (rc < 0) {
        LOG_ERR("Failed to enable UART RX.");
        return rc;
    }

    return 0;
}


static void uart_callback(const struct device *dev, struct uart_event *evt, void *user_data) {
    /* There's a strange bug with the DMA backing this async UART driver.
     * We get a single octet on boot (likely caused by a wiggle on the UART
     * line), and that introduces an off-by-one error that never goes away.
     * The fix seems to be resetting the UART after we get that character.
     */
    static bool destroyed_first_message = false;

    switch (evt->type) {
    case UART_RX_RDY:
    {
        if (!destroyed_first_message) {
            uart_rx_disable(dev);
            return;
        }

        struct opito_serial_tunnel_module_event *msg = new_opito_serial_tunnel_module_event();

        const uint8_t *ptr = &evt->data.rx.buf[evt->data.rx.offset];
        memcpy(msg->data.data_raw.buffer, ptr, evt->data.rx.len);
        msg->data.data_raw.length = evt->data.rx.len;
        msg->data.data_raw.timestamp = k_uptime_get();
        msg->type = OPITO_EVT_SERIAL_TUNNEL_DEVICE_DATA_RX;
        APP_EVENT_SUBMIT(msg);
        break;
    }

    case UART_RX_BUF_REQUEST:
    {
        if (!rx_buf1.is_in_use) {
            rx_buf1.is_in_use = true;
            uart_rx_buf_rsp(dev, rx_buf1.buffer, sizeof(rx_buf1.buffer) - 1);
        } else if (!rx_buf2.is_in_use) {
            rx_buf2.is_in_use = true;
            uart_rx_buf_rsp(dev, rx_buf2.buffer, sizeof(rx_buf2.buffer) - 1);
        }

        /* If we didn't manage to hand over a buffer, we'll get the disabled event. */
        break;
    }

    case UART_RX_BUF_RELEASED:
        struct rx_buf *buf = CONTAINER_OF(evt->data.rx_buf.buf, struct rx_buf, buffer);
        buf->is_in_use = false;
        break;

    case UART_RX_DISABLED:
        rx_buf1.is_in_use = true;
        rx_buf2.is_in_use = false;
        uart_rx_enable(tunnel_port, rx_buf1.buffer, sizeof(rx_buf1.buffer), CONFIG_OPITO_SERIAL_TUNNEL_RECEIVE_TIMEOUT_US);

        if (!destroyed_first_message) {
            destroyed_first_message = true;
        } else {
            /* This is buffer overrun - as we weren't able to pass another buffer to the UART! */
            SEND_ERROR(opito_serial_tunnel, OPITO_EVT_SERIAL_TUNNEL_ERROR, -ENOBUFS);
        }
        break;

    default:
        break;
    }
}
Let me know if I've missed anything! For now, our workaround is fine - but it seems like a bug that will bite again in future :)

  • No issue with any delays from our side. The work around is working really well and I haven't noticed any issues. We're not streaming lots of data - but we do have a few devices reporting every 10 minutes that have remained working consistently.

    WRT the off-by-one on the buffer size - that's likely left over from me trying to figure out why I was always missing an octet.

    Your description above sounds like you may have reproduced the issue. I'm working on another product which will use the same UART and there will be 50 or so of those deployed over the next couple of months - plenty of test vectors and these ones will have higher bandwidth I believe.

    If you find a better fix, or if you find the underlying issue, I can push updates out to our devices after a bit of testing in our lab. Thanks for your efforts on this bug!

  • Hi dan_opito,

    It has been a while, so I want to give you an update. I have reported the issue internally. However, as developers are away on summer vacations, there has not been any feedback yet.

    On that note, due to the same reasons, we are currently understaffed, and further response will be delayed. My apology for the inconvenience.

  • No problem, Hieu. I appreciate you taking the time to look into this bug for me. In the meantime, my workaround is working great - we've got this on a couple of different products and I haven't noticed any issues.

  • I think I've finally seen this issue in the field, even with my fix/hack. we have a hundred or so products in the field that all use this uart method - and I noticed the other day that exactly one of these has stopped working. Upon closer inspection, it is sending the correct number of characters but with an offset of two.

    The code in my initial message is still what I am using today - and until I spotted this device (which died two days ago) things have been working well.

    Can I get an update on where this is at?

  • Hi Dan,

    I am sorry to hear that the issue showed up again.

    We have registered it internally, but on DevZone, I cannot comment on the progress or the expected fix date.

    If this is critical, then could you please have a word with your local Nordic Regional Sales Manager?

Related