This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Libuarte_async bug(s) - missing data / wrong buffer returned.

Hi,

We've been having problems trying to implement a high-speed uart communication between the nRF52832 and an STM32.

We switched to using libuarte so we can use DMA even though our communication protocol has variable-sized packets Unfortunately, after several seconds of working perfectly, we suddenly start seeing errors. In between good new data blocks, we suddenly have one block that has old data. So either the data was never written by DMA, or it was written somewhere other than we are told to read from. It was only for the amount of data that the NRF_LIBUARTE_ASYNC_EVT_RX_DATA reported. It seemed kinda like the timeout triggered before the data was actually copied by the DMA or something along those lines.

We've reduced the transmission speed to 115200, and the error still occurs. We've used a logic analyzer to check if the data is really being transferred - it definitely is, but the nordic is not getting it. We've tried it with SDK 15.0, 15.3 and 16.0, the problem remains.

Since our codebase is way too complex to try and post something helpful here, we tried recreating the issue with two nRF52840 DKs.

Unfortunately, we're not getting the same errors, but are getting another error in the first RX event. First, right after we initialize libuarte_async we get NRF_LIBUARTE_ASYNC_EVT_ERROR errorSrc 0 - this started with SDK 16.0, it wasn't occurring when using SDK 15.x with ported libuarte. Then we get a bunch of skipped bytes reported (check the attached project):

<info> app: Rx 128@x20004B84
<error> app: RX: expected x65, got x39 instead
<error> app: RX: expected x3A, got x3D instead
<error> app: RX: expected x3E, got x41 instead
<error> app: RX: expected x42, got x45 instead
<error> app: RX: expected x46, got x49 instead
<error> app: RX: expected x4A, got x4D instead

Note: We're using arm-none-eabi-gcc  v4.9.3 to build it the test project, targeting nrf52840 chip sitting on development kits. They are connected and both running the same firmware, both doing RX and TX using libuarte.

ble_app_libuarte_test.tar.gz

Parents
  • I am not aware of any issue, but using the latest is in general the recommended.

    My only input is that I would recommend to increase .timeout_us = 100 to 300 (or possible more for testing). Reasoning is that depending on other interrupts that may be occurring (causing the uart to be pending), you want to make sure that you can always handle the uart interrupt in time before the next may occur.

    "errorSrc 0" was very odd, maybe you can set a breakpoint on the error and double check value. Making sure RXD is always pulled up can be something to check.

    Best regards,
    Kenneth

  • My only input is that I would recommend to increase .timeout_us = 100 to 300 (or possible more for testing). Reasoning is that depending on other interrupts that may be occurring (causing the uart to be pending), you want to make sure that you can always handle the uart interrupt in time before the next may occur.

    So, after more testing, this did indeed solve or work around our issue in the main project. We will set it to 1ms to be safe.

    However, we fail to understand exactly why this is causing a problem, and think this should be able to be fixed in your libuarte implementation. I don't see a reason why a delayed interrupt handler should cause libuarte to give us old data, or why an interrupt would need to be handled before the next could occur. After all, these are not uart interrupts, but timer/counter interrupts,right? None of these should do anything to uart, which should happily continue receiving.

    To be honest, I did not fully understand the implementation of libuarte, as I have not spent enough time with PPI to really konw exactly what you guys are doing there, and there are very few comments, so I can't see exactly where things are going wrong, but I'm assuming these things should be your design goals:

    • The counter that counts incoming bytes should not be stopped by the timeout, but only be reset by the buffer full event, or actually it should just reset whenever it reaches 128 (or whatever the buffer size is)
    • The timeout interrupt handler should just read the counter, check how much was copied last time, and tell us where and how much is new in the buffer.
    • The buffer full interrupt should do exactly the same, except it doesn't need to read the counter. Also, it should prepare the next buffer to write to. 

    My assumption is that things start going wrong if we're getting several timeout interrupts, as well as a buffer full interrupt before the first one can be handled, and that this causes one of the handlers to read data from the wrong buffer.

    As far as we can see, the data we're seeing in a case like this is several hundred bytes old, so it seems like the counter is read correctly, but it's reading from a new a previously used buffer. (if it were using the wrong counter value, and reading from the right buffer, the data would be less than 128 bytes old).

    Basically my question to Nordic is: Does libuarte handle the case correclty when one or more timeout interrupts and buffer full interrupts are delayed by the softdevice, and in the worst case the buffer full interrupt is handled before the timeout interrupt?

    And that last part poses the question: if two interrupts have the same priority, is it guaranteed that they are handled in the same order they occured, or do peripherals in the same priority group have an internal priority (eg timer0 always being handled before timer1)? Does the DMA buffer full have a higher priority than the timeout interrupt?

    My worry is that our workaround with setting such a long timeout only makes this case much more seldom, so we're not seeing it during testing, but that it might still occur at some point after all.

  • Hi,

    If my assumption should be any way correct, it would require that the incoming data from the STM32 is not back to back, but may actually have sporadic delay's of tens of us between bytes. If that is the case I think it could be that you will have two timeouts of 100us almost back to back, do you have any indication this may be the case by looking at the data from the STM32?

    Kenneth

  • Yes, this is quite possible. While testing, we did see some cases where the data was transferred in smaller blocks:

    Rx 17 -> 955
    Rx 13 -> 972
    Rx 16 -> 985
    Rx 16 -> 1001
    Rx 7 -> 1017
    Rx 6 -> 0
    Rx 20 -> 6
    Rx 22 -> 26
    Rx 15 -> 48
    Rx 21 -> 63
    Rx 17 -> 84
    Rx 13 -> 101
    Rx 14 -> 114
    Rx 11 -> 128
    Rx 13 -> 139
    Rx 13 -> 152
    Rx 13 -> 165
    Rx 15 -> 178
    Rx 25 -> 193
    Rx 13 -> 218
    Rx 13 -> 231
    Rx 12 -> 244
    Rx 3 -> 256
    Rx 18 -> 259
    Rx 17 -> 277
    Rx 16 -> 294
    Rx 17 -> 310
    Rx 15 -> 327

    Even though one packet that we transmit is 255 bytes. 
    We are transmitting everything available using DMA on the STM as well, so this case only occurs while our firmware is filling the TX buffer at exactly the same time that DMA is transferring the data. Since we fill the buffer with data we are getting from USB, it is possible that there interrupts are delayed, and there are pauses between those small 3-25 byte chunks of data, and there is a chance they're more than 100us. 

    So yes, it's quite possible that there are almost back to back timeouts of 100us, but I think that libuarte should be able to handle that too, even if both back to back timeout handlers are delayed. 
    Lets just take the first three chunks, and say there were 150us delay in between each pair, and that the timeout handler only started in the middle of the 16 byte chunk. 

    I think the counter should just keep counting, so when the handler reads it, it would be at 17+13+8=28 bytes. This is what it should tell us in the callback, we copy that data, and tell it to free it. So it moves the pointer by 28 bytes. Immediately following the second handler is called, and reads the counter which is still at 28 or maybe 29 bytes by now, so it returns 0 or 1 byte, then frees that 1 byte and exits. 

    I don't see a reason why the second interrupt handler should somehow fail...

Reply
  • Yes, this is quite possible. While testing, we did see some cases where the data was transferred in smaller blocks:

    Rx 17 -> 955
    Rx 13 -> 972
    Rx 16 -> 985
    Rx 16 -> 1001
    Rx 7 -> 1017
    Rx 6 -> 0
    Rx 20 -> 6
    Rx 22 -> 26
    Rx 15 -> 48
    Rx 21 -> 63
    Rx 17 -> 84
    Rx 13 -> 101
    Rx 14 -> 114
    Rx 11 -> 128
    Rx 13 -> 139
    Rx 13 -> 152
    Rx 13 -> 165
    Rx 15 -> 178
    Rx 25 -> 193
    Rx 13 -> 218
    Rx 13 -> 231
    Rx 12 -> 244
    Rx 3 -> 256
    Rx 18 -> 259
    Rx 17 -> 277
    Rx 16 -> 294
    Rx 17 -> 310
    Rx 15 -> 327

    Even though one packet that we transmit is 255 bytes. 
    We are transmitting everything available using DMA on the STM as well, so this case only occurs while our firmware is filling the TX buffer at exactly the same time that DMA is transferring the data. Since we fill the buffer with data we are getting from USB, it is possible that there interrupts are delayed, and there are pauses between those small 3-25 byte chunks of data, and there is a chance they're more than 100us. 

    So yes, it's quite possible that there are almost back to back timeouts of 100us, but I think that libuarte should be able to handle that too, even if both back to back timeout handlers are delayed. 
    Lets just take the first three chunks, and say there were 150us delay in between each pair, and that the timeout handler only started in the middle of the 16 byte chunk. 

    I think the counter should just keep counting, so when the handler reads it, it would be at 17+13+8=28 bytes. This is what it should tell us in the callback, we copy that data, and tell it to free it. So it moves the pointer by 28 bytes. Immediately following the second handler is called, and reads the counter which is still at 28 or maybe 29 bytes by now, so it returns 0 or 1 byte, then frees that 1 byte and exits. 

    I don't see a reason why the second interrupt handler should somehow fail...

Children
No Data
Related