NCS Mesh: Relay function delays response messages

We are facing the problem that when relay feature is enabled on a mesh device (which is running nRF Connect SDK firmware), device response is delayed in a range of seconds.

To reproduce the problem we used light example project from NCS v1.8.0 and generated additional traffic using other device, which transmits unsegmented packets each 200ms. In between that traffic TTL packets are transmitted to which response gets delayed.

If relay feature is off, everything seems to work as expected.

Parents
  • Hi,

    Could you elaborate on what you mean with "enabling the relay feature"? Bluetooth Mesh uses message relaying to send messages from device to device, so this "feature" will always be present when using the protocol https://developer.nordicsemi.com/nRF_Connect_SDK/doc/latest/nrf/ug_bt_mesh_concepts.html#relays.

    The flooding based message relay will cause a lot of redundant traffic that may impact the throughput and reliability of the network, so if you're generating additional traffic and/or flood the frequency with too much data there will be some delay.

    Kind regards,
    Andreas

  • Hi Andreas, 

    we are aware that relaying messages is  a common task in Bluetooth mesh. But you can enable and disable this feature in a device to reduce the flooding. 

    Our observation was as follows:

    In a setup with multiple devices, of which none is relaying (each  is in direct range of each other) we increased the traffic to approximately 5 messages/second by simply forcing one of the devices to send out a message each 200ms (NOTE: no device is suscribed to these messages, so they should be handled on the network layer only).

    A simple GET request to a device that supports the DTT-server model is answered immediately with nearly no delay (30-100ms) which is fine.

    If we activate the relay feature exactly on that device which we poll (which means, that it now forwards one message every 200ms), then the STATUS reported because of a simple GET request is delayed more and more until the delay is about 8 seconds. 

    I cannot imagine that this is the standard behavior. Relaying a message every 200ms should never have such an crazy impact.

  • Hi,

    Thank you for elaborating on this and explaining the setup a bit more. I will look into this and discuss these numbers with the Mesh team. I will get back to you as soon as we land on anything conclusive/if we need more information

    Kind regards,
    Andreas

  • Hi,

    The initial thing we want you to check if you observe the same behavior with NCS v2.2.0? In older versions there might be some delays depending on the traffic.

    Kind regards,
    Andreas

Reply Children
  • Hi Andreas,

    we have compared various NCS versions. The newer the version, the better the results, but even in the NCS v2.2.0 there are still some delays.

    Summary of the results (same setup as described before, delays are checked by sending a Config Default TTL Get message and receiving Config Default TTL status message):

    NCS v1.8.0:  delay of up to 8 seconds between Get and Status message. The delay is increasing over a period of 25 seconds after enabling the relay feature until it stays constant at 8 seconds.

    NCS v2.1.0: delay of approximately 1.3 seconds + missing some STATUS messages (no reply at all). 

    NCS v2.2.0: reduced delay, but random between 30ms (immediate) and 900ms, most frequent values are in the range from 100ms to 350ms.

    NCS v2.2.0 / double traffic (1 message each 100ms): the delay  starts to increase in a range of about 2-3 seconds. It also seems that the one or other STATUS messages gets lost (no reply), even if the relay is deactivated (but this we have to investigate a bit more, maybe our scanning device is missing something). 

    NCS v2.1.0 / double traffic (1 message each 100ms): no change in behavior.

    FYI: we use the following network parameters:

    network transmit and relay retransmit: 1 retransmission after 40ms

    Kind regards

     

  • urieder said:
    we have compared various NCS versions. The newer the version, the better the results, but even in the NCS v2.2.0 there are still some delays.

    Thank you for sharing the results for the different versions

    One more thing that got brought up when discussing your results just now was the Publish retransmit count (Typically set to 1 retransmit, that is each message contents is sent a total of two times, i.e. as two separate messages). It could be that a buffer containing outbound packets fills up due to a high retransmit number, causing longer and longer delay until the buffer is filled so you get constant delay (but see some packet loss).

    Can you see how large this number is configured to be in your setup and change it if its too high?

    Kind regards,
    Andreas

  • Hi,

    as mentioned at the end of my last report: we typically use 1 retransmission, sometimes 2, but not more. From a theoretical point of view 1  retransmission (40ms delay) or 2 retransmissions (20ms delay) should be fine and not cause any buffer filling up.

  • Just to be clear, Publish Retransmit Count (4.2.2.6 in mesh profile spec)" is not the same as Network Retransmit Count (4.2.19.1 in Mesh profile spec) and Relay Retransmit Count (4.2.20.1 in Mesh profile spec). Publish Retransmit Count is on the access layer and the other two retransmit counts are on the network/transport layer 

    Network retransmit is used to decide the number if times you should send a message. If the count is 1, then you will send 2 messages from the sender. In the case where you have relay retransmit count = 1, you will relay the message two times sending 2 messages from the relay node to the receiver. For publish retransmission count = 1 you will repeat the entire procedure one more time, effectively doubling the number of messages sent.

    If you have a higher Publish Retransmit Count than you expect, you will be pushing the limit, if not exceeding the limit of what the throughput can handle and you will most likely flood the buffer. There are also a throughput limit on how many messages you can send per 10s sliding window which is 100 network PDUs in a window (Mesh profile spec v1.0.1 sec 2.3.9.4)

    Edit: In addition, per 3.7.4.1: "Due to limited bandwidth available that is shared among all nodes and other Bluetooth devices, it is important to observe the volume of traffic a node is originating. A node should originate less than 100 Lower Transport PDUs in a moving 10-second window." which means that if you're having a 100ms publish interval and that the publish retransmit count is larger than 0, the node is capped at sending one message every 200ms (since every message leads to two messages on 2 lower transport PDUs), or to not send a publish retransmit at all.
    Edit end.

    There are options to increase the throughput by using advertisement extension features such as extended advertising and multiple advertisement sets, but we will have to look into how to do that for a setup that does not use relaying if that ever becomes a requirement. For a solution using relaying, the following should allow you to increase the throughput with the two mentioned extension features. 

    Enable CONFIG_BT_EXT_ADV

    Enable CONFIG_BT_MESH_ADV_EXT

    Increase CONFIG_BT_EXT_ADV_MAX_ADV_SETS

    Increase CONFIG_BT_MESH_RELAY_ADV_SETS

    Kind regards,
    Andreas

Related