NCS Mesh: Relay function delays response messages

We are facing the problem that when relay feature is enabled on a mesh device (which is running nRF Connect SDK firmware), device response is delayed in a range of seconds.

To reproduce the problem we used light example project from NCS v1.8.0 and generated additional traffic using other device, which transmits unsegmented packets each 200ms. In between that traffic TTL packets are transmitted to which response gets delayed.

If relay feature is off, everything seems to work as expected.

Parents
  • Hi,

    Could you elaborate on what you mean with "enabling the relay feature"? Bluetooth Mesh uses message relaying to send messages from device to device, so this "feature" will always be present when using the protocol https://developer.nordicsemi.com/nRF_Connect_SDK/doc/latest/nrf/ug_bt_mesh_concepts.html#relays.

    The flooding based message relay will cause a lot of redundant traffic that may impact the throughput and reliability of the network, so if you're generating additional traffic and/or flood the frequency with too much data there will be some delay.

    Kind regards,
    Andreas

  • Hi Andreas, 

    we are aware that relaying messages is  a common task in Bluetooth mesh. But you can enable and disable this feature in a device to reduce the flooding. 

    Our observation was as follows:

    In a setup with multiple devices, of which none is relaying (each  is in direct range of each other) we increased the traffic to approximately 5 messages/second by simply forcing one of the devices to send out a message each 200ms (NOTE: no device is suscribed to these messages, so they should be handled on the network layer only).

    A simple GET request to a device that supports the DTT-server model is answered immediately with nearly no delay (30-100ms) which is fine.

    If we activate the relay feature exactly on that device which we poll (which means, that it now forwards one message every 200ms), then the STATUS reported because of a simple GET request is delayed more and more until the delay is about 8 seconds. 

    I cannot imagine that this is the standard behavior. Relaying a message every 200ms should never have such an crazy impact.

  • urieder said:
    we have compared various NCS versions. The newer the version, the better the results, but even in the NCS v2.2.0 there are still some delays.

    Thank you for sharing the results for the different versions

    One more thing that got brought up when discussing your results just now was the Publish retransmit count (Typically set to 1 retransmit, that is each message contents is sent a total of two times, i.e. as two separate messages). It could be that a buffer containing outbound packets fills up due to a high retransmit number, causing longer and longer delay until the buffer is filled so you get constant delay (but see some packet loss).

    Can you see how large this number is configured to be in your setup and change it if its too high?

    Kind regards,
    Andreas

  • Hi,

    as mentioned at the end of my last report: we typically use 1 retransmission, sometimes 2, but not more. From a theoretical point of view 1  retransmission (40ms delay) or 2 retransmissions (20ms delay) should be fine and not cause any buffer filling up.

  • Just to be clear, Publish Retransmit Count (4.2.2.6 in mesh profile spec)" is not the same as Network Retransmit Count (4.2.19.1 in Mesh profile spec) and Relay Retransmit Count (4.2.20.1 in Mesh profile spec). Publish Retransmit Count is on the access layer and the other two retransmit counts are on the network/transport layer 

    Network retransmit is used to decide the number if times you should send a message. If the count is 1, then you will send 2 messages from the sender. In the case where you have relay retransmit count = 1, you will relay the message two times sending 2 messages from the relay node to the receiver. For publish retransmission count = 1 you will repeat the entire procedure one more time, effectively doubling the number of messages sent.

    If you have a higher Publish Retransmit Count than you expect, you will be pushing the limit, if not exceeding the limit of what the throughput can handle and you will most likely flood the buffer. There are also a throughput limit on how many messages you can send per 10s sliding window which is 100 network PDUs in a window (Mesh profile spec v1.0.1 sec 2.3.9.4)

    Edit: In addition, per 3.7.4.1: "Due to limited bandwidth available that is shared among all nodes and other Bluetooth devices, it is important to observe the volume of traffic a node is originating. A node should originate less than 100 Lower Transport PDUs in a moving 10-second window." which means that if you're having a 100ms publish interval and that the publish retransmit count is larger than 0, the node is capped at sending one message every 200ms (since every message leads to two messages on 2 lower transport PDUs), or to not send a publish retransmit at all.
    Edit end.

    There are options to increase the throughput by using advertisement extension features such as extended advertising and multiple advertisement sets, but we will have to look into how to do that for a setup that does not use relaying if that ever becomes a requirement. For a solution using relaying, the following should allow you to increase the throughput with the two mentioned extension features. 

    Enable CONFIG_BT_EXT_ADV

    Enable CONFIG_BT_MESH_ADV_EXT

    Increase CONFIG_BT_EXT_ADV_MAX_ADV_SETS

    Increase CONFIG_BT_MESH_RELAY_ADV_SETS

    Kind regards,
    Andreas

  • Tnx Andreas, we will try the mentioned features.

    Nevertheless it seems that the behavior of NCS 2.2.0 is a lot better.

    Unfortunately we have already qualified the stack based on NCS 1.8.0.

    Will Nordic provide a fix for NCS 1.8.0 resulting in a behavior similar to NCS 2.2.0?

    How can such problems and fixes be handled, since customers which already qualified older NCS version are not allowed to use newer NCS without new qualification. Is there a solution or way of work how to handle such issues without needing re-qualification for new NCS? Or do i have an incorrect understanding of NCS usage ?

    Kind regards, Ulf

  • I've asked around to see if I can find someone who can answer the questions about fixing this in 1.8.0 and about the qualification procedure. I'll get back to you with an answer as soon as I get a conclusive answer, hopefully within the day/tomorrow.

    Kind regards,
    Andreas

Reply Children
  • Hi again,

    So the conclusion so far from the discussion I've had with my colleagues are the following

    urieder said:
    Will Nordic provide a fix for NCS 1.8.0 resulting in a behavior similar to NCS 2.2.0?

    I am still waiting for a verification from the product manager for Mesh on this, but initial talks about this question does not look promising with regards to adding that to 1.8.0. Nonetheless I believe that it will be faster to shift to the newest release and generate a new BT qualification than it will be to investigate the changes that might cause the throughput issues you're seeing and patch it into a NCS v1.8.x.

    But as mentioned, I am still waiting for a verification on this.

    urieder said:
    How can such problems and fixes be handled, since customers which already qualified older NCS version are not allowed to use newer NCS without new qualification. Is there a solution or way of work how to handle such issues without needing re-qualification for new NCS? Or do i have an incorrect understanding of NCS usage ?

    In terms of the qualification process, unfortunately for this case, any customer moving from 1.8.0 to 2.2.0 would have to generate a new qualification listing. This is because the qualification process for Bluetooth is based on the selection of features in a checklist (namely the ICS). When you add a new feature or features to an existing design, this automatically triggers the need for a new qualification (removing a feature does not). So in this case since there were quite a few feature changed between 1.8.0 and 2.2.0, a new listing would be required.

    Generally speaking, If a 1.8.x which had the same features as 1.8.0 but just with some bug fixes would have been released, that would not require a new QDID. 

    I will be out of office from this afternoon until Monday coming week, but I will try to relay any verification I get regarding the first question to you if they come before then.

    Kind regards,
    Andreas

Related