BLE data relay latency issues

Hello,

I am in the process of writing my bachelors thesis about using BLE to relay data over multiple hops.

In my test setup, I have 5 nrf52840 that connect to each other and form a linear "network". I can then send data to the first relay and it will get relayed through all the nRFs to the last one.

The problem is, Im currently trying to evaluate the latency of this system and hit a roadblock. I cannot figure out why the latency behaves how it does:

sometimes I get low latency results of about 100ms, and sometimes I get very high results of >1s, this is all with the exact same setup, sometimes even without disconnecting between the tests.

the strange thing is that the lows and highs are not evenly distributed, but it seems like its low latency for a while, then gradually gets better or worse and then repeats (although its really hard to say how it varies exactly, it sometimes gets worse very sudden too.)

I tried this with various connection parameters but I always observed the same behaviour. It could also be a bug in my software, but I dont think so / dont know.

the payload im testing with is 1.5Kb in size, and Im feeding the data from an android phone into the first relay.

I would greatly appreciate it if someone could just share thoughts on this, as Im out of ideas on what to try next.

thanks,

Jonas

Parents
  • Hi Jonas,

    Can you say more about how you transfer the data?

    • Is it as notifications or another method?
    • What is the connection interval
    • What is the event length?
    • How long packets do you use?
    • Are more/many packets typically sent per connection event?

    Also do you have any siffer trace of this, both the good case (where y0ou have low latency) and the bad case where you have high latency?

    Without knowing more, my gut feeling is that you are using fairly long connection events and event lengths, and send multiple packet per connection event. Is that so? If so, a packet loss could explain what you are seeing, as an event would end in that case, and the retransmission will happen in the next connection event.

    Einar 

  • latencytest_5relays.zip

    In the attached ZIP file you will find 4 different wireshark captures and a pulseview capture.

    the wireshark logs show the connections between all the relays and the pulseview log shows the latency / propagation delay of each relay (rising edge = start receiving test payload, falling edge = test payload complete)

    the wireshark logs and the pulsseview trace show the same test, except that the first time the payload was sent it was not logged with pulseview. from the second time on, pulseview and wireshark are in sync.

    If you look for packet no. 10436 in relay3_4 and packet 11460 in relay2_3 (which should be payload no #62 if I didnt miscount) I think I can see exactly what you thought has happened, the relay2_3 transmission was fast according to pulseview and I cannot see any retransmissions. The relay3_4 transmission however was slow and I can see multiple retransmissions. It would be very much appreciated if you could verify my findings though, as Im not really experienced with BLE sniffing.

    If all the above is correct, then I think that indeed the packet loss is the cause of my latency. This would totally make sense now that I think more about it, as I observed the latency get worse and worse over time and then get better and better again, which i now would explain as the connection intervals drifting towards until spontaneous synchronisation happens and latency is at its worst, then gradually gets better. Very Interesting!

    One question remains though, Is there a way to optimize this behaviour and reduce the amount of packet loss in that situation?

    kind regards,

    Jonas

Reply
  • latencytest_5relays.zip

    In the attached ZIP file you will find 4 different wireshark captures and a pulseview capture.

    the wireshark logs show the connections between all the relays and the pulseview log shows the latency / propagation delay of each relay (rising edge = start receiving test payload, falling edge = test payload complete)

    the wireshark logs and the pulsseview trace show the same test, except that the first time the payload was sent it was not logged with pulseview. from the second time on, pulseview and wireshark are in sync.

    If you look for packet no. 10436 in relay3_4 and packet 11460 in relay2_3 (which should be payload no #62 if I didnt miscount) I think I can see exactly what you thought has happened, the relay2_3 transmission was fast according to pulseview and I cannot see any retransmissions. The relay3_4 transmission however was slow and I can see multiple retransmissions. It would be very much appreciated if you could verify my findings though, as Im not really experienced with BLE sniffing.

    If all the above is correct, then I think that indeed the packet loss is the cause of my latency. This would totally make sense now that I think more about it, as I observed the latency get worse and worse over time and then get better and better again, which i now would explain as the connection intervals drifting towards until spontaneous synchronisation happens and latency is at its worst, then gradually gets better. Very Interesting!

    One question remains though, Is there a way to optimize this behaviour and reduce the amount of packet loss in that situation?

    kind regards,

    Jonas

Children
  • Hi Jonas,

    You can never avoid all packet loss, and particularily not in a congested RF environment. Bluetooth LE chagne channel every connection event though, so unless you have a very high number of devices that send a lot of data I would not expect that to be the main problem. Also, with your connection interval being less than 100 ms I do not see why loosing a few packets would suddenly lead to a 1 second latency. Perhaps there is another issue in your setup, or with how you measure the latency?

  • according to the wireshark logs, there are so many lost packets, that it sums up to about a second sometimes, but Im not sure on this as lowering the transmit power and even separating the devices into different rooms didnt really change anything. Its extremely frustrating. Sometimes I can measure a hundred times without any problem and then suddenly the delay goes way up. I dont understand it.

  • Are you in an appartment-block or similar where you have neighbours heavily using a microwave oven? That creates a lot of noise in the entier 2.4 GHz ISM band. To check if this really is the case, perhaps you could try to repeate the experiment somewhere with less 2.4 GHz traffic?

  • I have done some furhter research, and I now know that the big delay comes from the delay between calling bt_gatt_write_without_response_cb and the successful transmission of the data.

    [00:05:30.102,783] <inf> bluart: write_blocking: transmitting 1019 bytes
    [00:05:30.103,118] <inf> bluart: write_blocking: transmitting 5 bytes
    [00:05:30.103,302] <inf> bluart: write_blocking: transmitting 556 bytes
    [00:05:30.425,659] <inf> bluart: complete_test: TX complete
    [00:05:30.426,177] <inf> bluart: complete_test: TX complete
    [00:05:30.646,484] <inf> bluart: complete_test: TX complete
    

    as you can see in this log, there is a total of more than 500ms between calling the write function, and the execution of the TX complete callback. this is still with 90ms connection interval, and I changed the code so that every relay waits for the full payload and only relays it once its fully received, to limit the concurrent on-air traffic.

    the question now is, why is there such a long time between calling write and getting the complete callback? to me this still seems like an issue with the transmitting and would confirm my earlier findings that it is due to interference. Do you also think so? Or do you know of any other causes that would cause a delay there?

  • I managed to log another long delay and capture it with wireshark at the same time:

    CONFIG_BT_PERIPHERAL_PREF_MIN_INT=90
    
    CONFIG_BT_PERIPHERAL_PREF_MAX_INT=90 
    CONFIG_BT_PERIPHERAL_PREF_LATENCY=0
    CONFIG_BT_CTLR_DATA_LENGTH_MAX=251
    
    CONFIG_BT_PERIPHERAL_PREF_TIMEOUT=500
    CONFIG_BT_L2CAP_TX_MTU=1024
    #mtu+4
    CONFIG_BT_BUF_ACL_RX_SIZE=1028

    can you spot anything abnormal? to me this still looks like Interference... but again, Im not a pro at this.

    also, why are there so many small fragments? I put 1Kb of data into the Zephyr write function and have set the data length max to 251, so why are there such small packets?

Related