Mesh back and forth seems to break connection

Hi,

We have one customer having two CoAP hosts and some CoAP clients in the form of wireless sensors. The sensors are paired to a single host. The pairing is actually in the app level, where the sensor discovers the network IP of the host in pairing host. All the devices have the same PANID and network key.

Recently we have seen a scenario where some sensors seemingly stopped communication with the paired host. By looking at the RSSI graphs, we thought is this caused by a sensor constantly swinging back and forth between two hosts (one host acting as a router). We dont have access to the CLI interface of the hosts as this is a remote site. We see the Sensor RSSI reported back. This is its RSSI with the router/leader immediately connected to at the time.

Any ideas?

Cheers,

Kaushalya

0 kaushalyasat over 1 year ago in reply to Edvin
Hi Edvin,

Edvin said:
I was thinking about these. What error logs do you refer to?

int coap_send_request(enum coap_method method, const struct sockaddr *addr, const char *const *uri_path_options, uint8_t *payload, uint16_t payload_size, coap_reply_t reply_cb) { int ret; struct coap_packet request; uint8_t buf[MAX_COAP_MSG_LEN]; ret = coap_init_request(method, COAP_TYPE_NON_CON, uri_path_options, payload, payload_size, &request, buf); if (ret < 0) { LOG_ERR ("CoAP init failed: %d", errno); // <---------------- ERROR LOG goto end; } if (reply_cb != NULL) { coap_set_response_callback(&request, reply_cb); } ret = coap_send_message(addr, &request); if (ret < 0) { LOG_ERR("Transmission failed: %d", errno); // <---------------- ERROR LOG goto end; } end: return ret; }

I was referring to the error logs as marked above. I dont see any of these errors in my case. So I assume that 'coap_send_request ()' executes without any error. Am I correct?

Edvin said:
That would be if send_sensor_update() is not called. Do you have something indicating whether or not these are called at the time when the devices become unavailable?

I dont this this is the case as I can see this log message continuously from a disconnected SED.

LOG_INF ("ZS %d, RSSI %d, LQI %d, LQO %d, FW %04x", the_sensor_device->zoneState, RSSI, linkQalIn, linkQualOut, FWRevNum);

So it seems like my application code gets called continuously but data is not being send from that point onwards.

When we look at the console of the host, we couldn't see the log message for the data receive from these disconnected sensors. The disconnection could happen from

1. sensor thread stack

2. host thread stack

3. host application

Do you see any other ways?

Cheers,

Kaushalya
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Edvin over 1 year ago in reply to kaushalyasat

kaushalyasat said:
I was referring to the error logs as marked above. I dont see any of these errors in my case. So I assume that 'coap_send_request ()' executes without any error. Am I correct?

If you see error messages printed from that file in general. You can test with adding "LOG_ERR("Test");", to see if these error messages are visible in the log at all.

kaushalyasat said:
I dont this this is the case as I can see this log message continuously from a disconnected SED.

Where are those from? What trigger these?

It would be more interresting to continuously see the return value from coap_send_message(), or whatever message you use to send, at the time of the disconnection?

Try adding prints of the return value of the function that doesn't work (regardless of whether it is 0 or something else).

Does it:

1: Print that it returns 0 even though it is disconnected?

2: stop printing alltogether?

3: Print that it returns something else than 0?

BR,

Edvin
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 kaushalyasat over 1 year ago in reply to Edvin
Edvin said:
If you see error messages printed from that file in general. You can test with adding "LOG_ERR("Test");", to see if these error messages are visible in the log at all.

Yes I can see the LOG_ERR, from both my application level and also from coap_send_request(). I tested it from console and also RTT viewer.

Edvin said:
Where are those from? What trigger these?

This log message is sent by send_sensor_update () in coap_client_utils.c, just before calling the coap_send_request (). So we know the flow is working till that point.

Edvin said:
It would be more interresting to continuously see the return value from coap_send_message()

Agree. Unfortunately in fw Rev 1.1.1.0, which I sent to you first, doesnt have that - my bad. In latest fw it shows it and also we have implemented a noinit memory section where we keep the last returned value to coap_send_request(). Also in this section we maintain counters for failed tx and successful tx. So far we havent seen any failed, but again it might take months before that happen.

Edvin said:
Try adding prints of the return value of the function that doesn't work (regardless of whether it is 0 or something else).

It is done in the latest fw. We are waiting for any sensor to go into this mode again. Currently we get 47 as the return value, which I think the number of bytes send(?)

Edvin said:
Print that it returns 0 even though it is disconnected?

It prints whatever returned from coap_send_request() as follows. We havent seen returning 0 as there is always a network to connect to in the lab. Also if it is not connected, the send_sensor_update () wouldn't get called.

int ret; . . . ret = coap_send_request(COAP_METHOD_PUT, (const struct sockaddr *)&unique_local_addr, sensor_option, payload, sizeof(payload), NULL); LOG_INF ("ZS %d, RSSI %d, LQI %d, LQO %d, FW %04x RET: %d, RLOC: %04x", the_sensor_device->zoneState, RSSI, linkQalIn, linkQualOut, FWRevNum, ret, rloc);

Edvin said:
stop printing alltogether?

What you mean is the printing suddenly stop without any reason? We havent seen anything like that.

Edvin said:
Print that it returns something else than 0?

Yeah, we always see 47 so far.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Edvin over 1 year ago in reply to kaushalyasat

kaushalyasat said:
Yeah, we always see 47 so far.

Ok, so even when you do not see that the messages are being received, you can see that it prints the return value (+)47, which is the length of the packet that you are sending?

kaushalyasat said:
Yes I can see the LOG_ERR, from both my application level and also from coap_send_request(). I tested it from console and also RTT viewer.

Ok, good. That means that there is not some sort of config that disables that logging instance.

kaushalyasat said:
Also in this section we maintain counters for failed tx and successful tx. So far we havent seen any failed, but again it might take months before that happen.

So this means that so far you have not confirmed that the tx function returns 47 while the issue is ongoing? Or have you confirmed this?

Best regards,

Edvin
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 kaushalyasat over 1 year ago in reply to Edvin

Hi Edvin,

Edvin said:
Ok, so even when you do not see that the messages are being received, you can see that it prints the return value (+)47, which is the length of the packet that you are sending?

Yes. Yesterday we had the first sensor fall off 'with child supervision'. I couldn't see any messages regarding child supervision though in these sensors. But I can verify the following.

1. sensors are transmitting and the data is sent to its parent successfully. After the parent, I couldn't trace the packet any more as they cannot be decrypted.

2. intended destination doesnt receive these packets in application level

So I think the issue is in loosing a FTD to FTD (router to router) connection in multi-hop scenarios.

Edvin said:
So this means that so far you have not confirmed that the tx function returns 47 while the issue is ongoing? Or have you confirmed this?

I can confirm that 47 is received while a sensor or more precisely the system in this state. As I mentioned, I dont think this is relevant to the sensor (SED). This may well be an issue in router to router hop.

I am now researching how the MLE works. What happens when a router looses its connection with another router? I think it will try to find another path. Now my question is why a path cannot be found to the destination router/leader, where it was established earlier? Only change would happen would be some routers may be power cycles/off. But I have verified that the sensors could connect to the destination even with all the other routers are powered down.

Cheers,

Kaushalya
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel