This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Need help to improve Bluetooth Mesh throughput

I am currently using two nRF52832 SDK boards to implement text message exchange with another nRF52832 SDK board via BLE Mesh.

I based my development on the Light Switch example that was included with the Mesh SDK 1.0.0

One of the board was based on the Lightswitch Client and the other board is based on Lightswitch Server.

I added two opcodes to the original _SET, _GET, _SET_UNRELIABLE and _STATUS. One opcode is called "SEND_COMMAND "and the second opcode is "SEND_RESPONSE". I avoided building a custom model at this time so that I can avoid any issue with the provisioning. The main purpose is to test the throughput and latency to see if it is useable for a new product that we are developing.

The client board is programmed to send a text string about 15 characters. The opcode "SEND_COMMAND" is appended to the message.opcode together with a data structure that contains the text string and the length. The "access_model_publish() function is called to send out the packet. The command is published to a group address so that all the servers board will receive the command (currently I only have 1 server connected but the plan is to have all servers that subscribed to the group to receive the command in future)

The server board upon receiving the packet will decode the packet and then send a response. The response is sent with the opcode "SEND_RESPONSE" together with a response string of about 20 characters back to the client. The server too send the response using "access_model_publish()" function.

So far the application codes on both the client and server works well when triggering the client once a second to send the command text string and it does receive the response text string from the server most of the time.

However, when I try to increase the frequency of sending the command the communication seem to choke. The client could barely send and receive more than 2 command/response packets per second before it starts to choke and lose communication. Once the client stop sending the command you can see some of the previous packets being sent through but could take a few seconds to complete.

Has anyone experienced the same kind of throughput issue as I did? Is there any parameters that I can adjust to increase the throughput? I read somewhere on this forum that it may be possible for a node to send message of up to 24Hz (I assume it means the client could send up to 24 packets a second) but currently I could not even send and receive faster than 2 packets a second.

Hopefully anyone who has gone through the same experience could point me to possible solutions.

Thank you!

Top Replies

Parents

0 Susheel Nuguru over 7 years ago

Hi Alecontrol,

Sorry for the delayed answer. Can you please upload the source files for us to be able to reproduce this issue.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 aleccontrol over 7 years ago in reply to Susheel Nuguru

simple_on_off.zip

Thanks for your reply.

I have attached the modified files for the "simple_on_off_client", "simple_on_off_server" used for testing.

Client

To test the command/response, add the following code fragment to the main.c in the simple_on_off_client project:

   extern uint32_t send_command_client(simple_on_off_client_t * p_client, uint8_t *command, uint16_t length);

   uint8_t command[] = "Test";
        status = send_command_client(&m_clients[GROUP_CLIENT_INDEX], command, 4);

Note: I have edited the length to 4 (was 9 when first posted) since the command string now only contains 4 bytes)

Server

The simple_on_off_server has been modified so that it will simply copy the message it received and echo back to the client as a response package.

The client will run the function "handle_response_cb() when it receives the response from the server. You can log the response it received from the server. In our program we actually send the message out of the UART that requires to link with the app_uart files so you can simply comment out the call to "com1_strout() in our program".

Thank you if you can share with us your finding and suggested solutions.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 leonwj over 7 years ago in reply to aleccontrol

hello aleccontrol,

I just re-downloaded both the v1.0 and v1.01 Mesh SDKs from the Nordic website and they both have the case PROV_STATE_CONFIG_PUBLICATION_HEALTH: set to ACCESS_PUBLISH_RESOLUTION_10S. (Possibly you are confusing this with the case PROV_STATE_CONFIG_PUBLICATION_ONOFF: case which is directly below in the code). So i'm not sure if that setting pertains to your issue.

In any case, I went ahead and implemented the simple_on_off model that you attached within our environment. As you suggested, we commented out the UART comm1_strout functionality and simply echoed back out the received command string along with a counter value to iterate through the responses from each server node. We set the send_command_client(&m_clients[GROUP_CLIENT_INDEX], command, 4); function to trigger every time we received a heartbeat message from any of the 3 provisioned nodes.

So in the above scenario, the 3 server nodes send out a heartbeat message (at the specified interval, we tested 1 sec and 0.1 sec) which then triggers the client node to send the "Test" command to the 3 servers which each then send back an echoed response.

Needless to say, we are not seeing the same throughput issue that you have reported. I have attached the RTT log file from the 100MS (0.1 sec) test run that we analyzed over a 5 minute (300 sec) period. Within that log we see significant mesh network throughput (albeit a simple echo of the "Test" string with a counter value returned). We do however see approximately 25% packet loss at that message frequency.

Regards,

rtt-log-100ms-cmd-resp[3-svrs].zip
Cancel
Vote Up +2 Vote Down

Sign in to reply

Verify Answer

Cancel
0 aleccontrol over 7 years ago in reply to leonwj

Hi Leonwj,

You are right that I was looking at the wrong "case" statement so yes, the following statement was indeed the default for "case :PROV_STATE_CONFIG_PUBLICATION_HEALTH"

pubstate.publish_period.step_res = ACCESS_PUBLISH_RESOLUTION_10s

Thank you very much for testing out the command/response program. I think the large percentage of packet losses at the server is probably the reason why we feel that the communication appears to choke when attempt to send out faster than 2 packets a second.

In an M2M communication typically when the client sends out a command packet, it will wait for a response packet to be received from the server before it proceed to send the next command. If the server fails send a response, then the client will time out after some time (in our case we set the time out to 1 second). The client will only send the next command packet after the time out.

If 25% of the consecutive commands is lost as per your test, that means that the client will fail to receive 1 in 4 responses. The client is thus unable to sustain a smooth command/response communication with the server every 100ms as it constantly have to wait for the time out before it can send the next packet. So the communication appears to stop frequently due to the time out.

What is the reason for such high packet loss and is there any way to minimize the packet loss? The client and the server are actually sitting at less than 1 meter apart and I only have 1 client and 1 server communicating. There wasn’t any other Bluetooth devices communicating nearby. What would be the maximum command/response exchange rate in order to maintain <= 1% packet loss?

Thanks again for your expert advise!
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 leonwj over 7 years ago in reply to aleccontrol
hello aleccontrol,

Apologies for the belated response to your latest comment...

Certain things that I wanted to point out regarding the packet loss statistics that I presented, are as follows:

1. the stats were from using the model that you uploaded without any changes, so we would possibly need to review the model before attributing any loss to the underlying mesh

2. We implemented the cmd/resp mechanism under the heartbeat callback mechanism, which may not be the most appropriate place to do so. e.g. if we set the heartbeat period to 100 msecs for each server, that means that when each server sends a message, the client immediately publishes a "Test" command to the group (i.e. 3 servers) which then each send back a response. So every 1/10th sec we have:

a. 3 heartbeat messages (server(s) -> client)

b. 1 command message published to the group (client -> server(s))

c. 3 response messages sent to the client (server(s) -> client)

Given that there shouldn't be any relaying taking place (the nodes are within close proximity and message caching would ensure that repeats are avoided), that means 70 messages every second across the mesh. Also, we haven't done any timeline analysis to determine if there's a specific point(s) within our basic test where we start to see packet loss or whether the loss was uniform.

I do recall reading a vendor produced whitepaper, where circa 900 mesh devices were deployed over a 2,000 sq m. area with a target response time/latency of 300 msec being set for message delivery in message traffic configs of (low - ~150bps) - (medium - ~1kbps) - (high - ~3kbps). I grabbed a screenshot of the results at the time (see below), which highlighted that at least 99.1% of messages were successfully delivered with 300 msec latency across a high traffic (enhanced configuration) in both sparse and dense deployments. I recall that enhanced meant that source message re-transmissions were performed as well as advertising randomization whereas Baseline was a standard mesh implementation. Also note that this wasn't Nordics's mesh stack!! (If you are interested, let me know and I'll see if I can dig out the link to the whitepaper.)

I guess the general takeaway(s) from the above would be:

that we shouldn't base any firm design decision(s) on the packet loss figures that I provided (since the config is/was not optimized or validated)

the High traffic w/Dense deployment Baseline stat (69.2% - from the table), does reflect a similar level of degradation to what we saw in our basic test

there are Enhanced optimizations that can be implemented to increase mesh reliability and throughput

the results reflect fairly well on Bluetooth mesh when compared to results i've seen across Zigbee, Z-wave etc. based on lower device numbers (the caveat being that this is only one specific configuration under test)

I hope the above helps.

Regards,
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 leonwj over 7 years ago in reply to leonwj

Additionally, (for both yourself and anyone else who's interested in the throughput level expectations of Bluetooth Mesh), you might want to listen to the following podcast/webinar in which Szymon Slupik, Chair of the Bluetooth SIG Mesh Working Group highlights why he believes that Bluetooth Mesh meets the required performance levels (among other benefits) for both industrial and home.

(please also note that I'm not affiliated with Szymon and/or Mr. Beacon in any way but have simply followed the mesh specification process since FRD 0.7)

Regards,
Cancel
Vote Up +1 Vote Down

Sign in to reply

Verify Answer

Cancel
0 aleccontrol over 7 years ago in reply to leonwj

Hi Leonwj,

I appreciate that you took time to reply and provided some useful statistics. I agree that your test case was pretty intense since you are using 3 servers to trigger the client to send command every 100ms so the air wave is probably pretty congested.

In our case we are just sending the command out of the client (the frequency of sending the command is entirely controlled by the client) and then the client will wait to receive the response back from one single server. Only when the client has received a response from the server will it continue to send another command. This is pretty common method in industrial control where the master (client) send a command to a slave (server) and then wait for the slave to return the response before continuing further.

Would you be able to test such a setup and determine the latency that is expected for a complete command/response exchange? We are interested in knowing how many command and response exchanges that can realistically be accomplished via the BLE mesh to determine if we could use BLE Mesh as a communication media.

What we experienced is that it seems impossible for the client to sustain a continuous command/response exchanges with the server without frequent errors. One in every few commands that were sent out by the client did not get a response from the server and so the client side blocks until time-out (set at 2 seconds). In our test case we are sending just a few bytes (below 11 bytes threshold before SAR kicks in) and in real life the client and server will be sending and receiving many more bytes per exchange so it is important that latency and packet loss are kept to minimum.

In our application the client may occasionally need to retrieve say 2000 bytes of data from a server. What would be best the way to achieve it in the shortest possible amount of time? In traditional communication via UART the client will send a command packet that tell the server to return say up to 100 bytes per response packet and so it will take about 20 successful exchanges to retrieve 2000 bytes. How long would you expect it will take to retrieve this amount of data?

Thank you very much!
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 aleccontrol over 7 years ago in reply to leonwj

Hi Leonwj,

I appreciate that you took time to reply and provided some useful statistics. I agree that your test case was pretty intense since you are using 3 servers to trigger the client to send command every 100ms so the air wave is probably pretty congested.

In our case we are just sending the command out of the client (the frequency of sending the command is entirely controlled by the client) and then the client will wait to receive the response back from one single server. Only when the client has received a response from the server will it continue to send another command. This is pretty common method in industrial control where the master (client) send a command to a slave (server) and then wait for the slave to return the response before continuing further.

Would you be able to test such a setup and determine the latency that is expected for a complete command/response exchange? We are interested in knowing how many command and response exchanges that can realistically be accomplished via the BLE mesh to determine if we could use BLE Mesh as a communication media.

What we experienced is that it seems impossible for the client to sustain a continuous command/response exchanges with the server without frequent errors. One in every few commands that were sent out by the client did not get a response from the server and so the client side blocks until time-out (set at 2 seconds). In our test case we are sending just a few bytes (below 11 bytes threshold before SAR kicks in) and in real life the client and server will be sending and receiving many more bytes per exchange so it is important that latency and packet loss are kept to minimum.

In our application the client may occasionally need to retrieve say 2000 bytes of data from a server. What would be best the way to achieve it in the shortest possible amount of time? In traditional communication via UART the client will send a command packet that tell the server to return say up to 100 bytes per response packet and so it will take about 20 successful exchanges to retrieve 2000 bytes. How long would you expect it will take to retrieve this amount of data?

Thank you very much!
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

No Data