Mesh back and forth seems to break connection

Hi,

We have one customer having two CoAP hosts and some CoAP clients in the form of wireless sensors. The sensors are paired to a single host. The pairing is actually in the app level, where the sensor discovers the network IP of the host in pairing host. All the devices have the same PANID and network key.

Recently we have seen a scenario where some sensors seemingly stopped communication with the paired host. By looking at the RSSI graphs, we thought is this caused by a sensor constantly swinging back and forth between two hosts (one host acting as a router). We dont have access to the CLI interface of the hosts as this is a remote site. We see the Sensor RSSI reported back. This is its RSSI with the router/leader immediately connected to at the time.  

Any ideas?

Cheers,

Kaushalya

Parents
  • Hello,

    Can you please try to capture a sniffer trace using the nRF Sniffer for 802154?

    What do you mean by "swinging back and forth between two hosts"? Do you mean that it (an End Device, I assume) keeps changing between two routers?

    Does it ever re-enter the network, or does it disconnect completely?

    Do the nodes move around (physically)? Or are the nodes more or less stationary?

    Best regards,

    Edvin

  • Hi Edvin,

    I am building another FW with return values gets logged. Unfortunately we have to wait until this problem creeps up again since we dont know how to recreate it. 

    Is there a way to rearrange wirechark packets so that we can push a MLE packet up? Then the subsequent logs can get extended address from that.

    Also as I mentioned earlier, we dont see any other error messages in my RTT viewer. When I look into 'coap_send_request (...)', it has 'coap_init_request (...)' and 'coap_send_message(...)' functions. Both these has error logs printed if something goes wrong. So can we assume that there were no errors reported during a packet transmission? 

    how to reset the network? What APIs should I use?

    Cheers,

    Kaushalya

  • Hello Kaushalya,

    I don't think it is possible to rearrange the packets like this, no. If anything, I think you woudl have to edit the raw .pcapng file. 

    kaushalyasat said:

    Also as I mentioned earlier, we dont see any other error messages in my RTT viewer. When I look into 'coap_send_request (...)', it has 'coap_init_request (...)' and 'coap_send_message(...)' functions. Both these has error logs printed if something goes wrong. So can we assume that there were no errors reported during a packet transmission? 

    That depends on the implementation. What error messages did you see? Are they printed from within the implementation of coap_send_message()? Or from the function that checked their return value?

    You also need to consider the possibility that for some reason these functions aren't called at all, due to some other error.

    Best regards,

    Edvin

  • Hi Edvin,

    I don't think it is possible to rearrange the packets like this, no.

    As I mentioned, in this thread, one of Nordic engineers mentioned something like that -  nRF Sniffer integration for 802.15.4 in a python scipt (Pcap file problems) 

    Also I can see in wireshark, you can  time shift packets around like this.

    I tried doing this, but couldn't see any effect. I am not sure my operation was correct, so I leave this for a wireshark guru to comment.

    What error messages did you see?

    What you mean is error messages I saw in the RTT viewer? I didn't see any.

    Are they printed from within the implementation of coap_send_message()?

    Yes. If you look at coap_send_request (...) function in coap_utils.c, you can see that coap_init_request (...) is logged from within and coap_send_message (...) return is logged. I didn't go any deeper as I got stuck in z_impl_zsock_sendto (...) in sockets.c.

    You also need to consider the possibility that for some reason these functions aren't called at all, due to some other error.

    static void send_sensor_update (struct k_work *item) {
        .
        .
        .
        
    	LOG_INF ("ZS %d, RSSI %d, LQI %d, LQO %d, FW %04x", the_sensor_device->zoneState, RSSI, linkQalIn, linkQualOut, FWRevNum);
    
    	memcpy (&payload[1], myExtAddr.m8, 8);
    	memcpy (&payload[9], myEUI64.m8, 8);
    
    	payload[17] = ((the_sensor_device->temp)>>8) & 0xff;
    	payload[18] = (the_sensor_device->temp) & 0x00ff;
    
    	payload[19] = ((the_sensor_device->vbat)>>8) & 0xff;
    	payload[20] = (the_sensor_device->vbat) & 0x00ff;
    	payload[21] = RSSI;
    	payload[22] = linkQalIn;
    	payload[23] = linkQualOut;
    	payload[24] = the_sensor_device->zoneState;
    	payload[25] = (uint8_t)(FWRevNum >> 8);
    	payload[26] = (uint8_t)(FWRevNum & 0x00ff);
    
    	ARG_UNUSED(item);
    
    	if (net_ipv6_is_addr_unspecified (&unique_local_addr.sin6_addr)) {
    		LOG_WRN("Peer address not set. Activate 'provisioning' option on the server side");
    		return;
    	}
    
    	coap_send_request(COAP_METHOD_PUT, (const struct sockaddr *)&unique_local_addr, sensor_option, payload, sizeof(payload), NULL);

    This is the code section from the log print to coap_send_request (...) in my code. Can you think of any ways that coap_send_request (..) may not have been called? if IP6 is missing, I would get a LOG_WRN, which I dont get.
    Cheers,
    Kaushalya

  • kaushalyasat said:
    When I look into 'coap_send_request (...)', it has 'coap_init_request (...)' and 'coap_send_message(...)' functions. Both these has error logs printed if something goes wrong

    I was thinking about these. What error logs do you refer to?

    kaushalyasat said:
    Can you think of any ways that coap_send_request (..) may not have been called?

    That would be if send_sensor_update() is not called. Do you have something indicating whether or not these are called at the time when the devices become unavailable?

    I am sorry, but we are several months into this, and I am not quite sure what we are discussing anymore. You have some devices in a remote area that you do not have physical access to where you see some devices drop out from time to time, right? Perhaps you can try to replicate this in a local area where you have access to your devices? Reset the entire network, start sniffing before you start your devices so that the sniffer can capture everything from the beginning. Then it should be able to pick up and resolve all the short addresses. When you detect the issue, look into the log from that particular device. Does it say anything when trying to call coap_send_request()? Any error messages? coap_send_request() also returns a value based on how it did. It returns 0 on success, and a negative number on failure. Try printing something in the log in the cases where this returns < 0. What does it return when it fails?

    Best regards,

    Edvin

  • Hi Edvin,

    I was thinking about these. What error logs do you refer to?

    int coap_send_request(enum coap_method method, 
    					  const struct sockaddr *addr,
    		      		  const char *const *uri_path_options, 
    					  uint8_t *payload,
    		      		  uint16_t payload_size, 
    					  coap_reply_t reply_cb)
    {
    	int ret;
    	struct coap_packet request;
    	uint8_t buf[MAX_COAP_MSG_LEN];
    
    	ret = coap_init_request(method, COAP_TYPE_NON_CON, uri_path_options,
    				payload, payload_size, &request, buf);
    	if (ret < 0) {
    		LOG_ERR ("CoAP init failed: %d", errno); // <---------------- ERROR LOG
    		goto end;
    	}
    
    	if (reply_cb != NULL) {
    		coap_set_response_callback(&request, reply_cb);
    	}
    
    	ret = coap_send_message(addr, &request);
    	if (ret < 0) {
    		LOG_ERR("Transmission failed: %d", errno);  // <---------------- ERROR LOG
    		goto end;
    	}
    
    end:
    	return ret;
    }

    I was referring to the error logs as marked above. I dont see any of these errors in my case. So I assume that 'coap_send_request ()' executes without any error. Am I correct?

    That would be if send_sensor_update() is not called. Do you have something indicating whether or not these are called at the time when the devices become unavailable?

    I dont this this is the case as I can see this log message continuously from a disconnected SED. 

    LOG_INF ("ZS %d, RSSI %d, LQI %d, LQO %d, FW %04x", the_sensor_device->zoneState, RSSI, linkQalIn, linkQualOut, FWRevNum);

    So it seems like my application code gets called continuously but data is not being send from that point onwards.

    When we look at the console of the host, we couldn't see the log message for the data receive from these disconnected sensors. The disconnection could happen from 

    1. sensor thread stack

    2. host thread stack

    3. host application 

    Do you see any other ways?

    Cheers,

    Kaushalya

Reply
  • Hi Edvin,

    I was thinking about these. What error logs do you refer to?

    int coap_send_request(enum coap_method method, 
    					  const struct sockaddr *addr,
    		      		  const char *const *uri_path_options, 
    					  uint8_t *payload,
    		      		  uint16_t payload_size, 
    					  coap_reply_t reply_cb)
    {
    	int ret;
    	struct coap_packet request;
    	uint8_t buf[MAX_COAP_MSG_LEN];
    
    	ret = coap_init_request(method, COAP_TYPE_NON_CON, uri_path_options,
    				payload, payload_size, &request, buf);
    	if (ret < 0) {
    		LOG_ERR ("CoAP init failed: %d", errno); // <---------------- ERROR LOG
    		goto end;
    	}
    
    	if (reply_cb != NULL) {
    		coap_set_response_callback(&request, reply_cb);
    	}
    
    	ret = coap_send_message(addr, &request);
    	if (ret < 0) {
    		LOG_ERR("Transmission failed: %d", errno);  // <---------------- ERROR LOG
    		goto end;
    	}
    
    end:
    	return ret;
    }

    I was referring to the error logs as marked above. I dont see any of these errors in my case. So I assume that 'coap_send_request ()' executes without any error. Am I correct?

    That would be if send_sensor_update() is not called. Do you have something indicating whether or not these are called at the time when the devices become unavailable?

    I dont this this is the case as I can see this log message continuously from a disconnected SED. 

    LOG_INF ("ZS %d, RSSI %d, LQI %d, LQO %d, FW %04x", the_sensor_device->zoneState, RSSI, linkQalIn, linkQualOut, FWRevNum);

    So it seems like my application code gets called continuously but data is not being send from that point onwards.

    When we look at the console of the host, we couldn't see the log message for the data receive from these disconnected sensors. The disconnection could happen from 

    1. sensor thread stack

    2. host thread stack

    3. host application 

    Do you see any other ways?

    Cheers,

    Kaushalya

Children
  • kaushalyasat said:
    I was referring to the error logs as marked above. I dont see any of these errors in my case. So I assume that 'coap_send_request ()' executes without any error. Am I correct?

    If you see error messages printed from that file in general. You can test with adding "LOG_ERR("Test");", to see if these error messages are visible in the log at all.

    kaushalyasat said:
    I dont this this is the case as I can see this log message continuously from a disconnected SED. 

    Where are those from? What trigger these?

    It would be more interresting to continuously see the return value from coap_send_message(), or whatever message you use to send, at the time of the disconnection?

    Try adding prints of the return value of the function that doesn't work (regardless of whether it is 0 or something else). 

    Does it:

    1: Print that it returns 0 even though it is disconnected?

    2: stop printing alltogether?

    3: Print that it returns something else than 0?

    BR,

    Edvin

  • If you see error messages printed from that file in general. You can test with adding "LOG_ERR("Test");", to see if these error messages are visible in the log at all.

    Yes I can see the LOG_ERR, from both my application level and also from coap_send_request(). I tested it from console and also RTT viewer.

    Where are those from? What trigger these?

    This log message is sent by send_sensor_update () in coap_client_utils.c, just before calling the  coap_send_request (). So we know the flow is working till that point. 

    It would be more interresting to continuously see the return value from coap_send_message()

    Agree. Unfortunately in fw Rev 1.1.1.0, which I sent to you first, doesnt have that - my bad. In latest fw it shows it and also we have implemented a noinit memory section where we keep the last returned value to coap_send_request(). Also in this section we maintain counters for failed tx and successful tx. So far we havent seen any failed, but again it might take months before that happen.

    Try adding prints of the return value of the function that doesn't work (regardless of whether it is 0 or something else). 

    It is done in the latest fw. We are waiting for any sensor to go into this mode again. Currently we get 47 as the return value, which I think the number of bytes send(?)

    Print that it returns 0 even though it is disconnected?

    It prints whatever returned from coap_send_request() as follows. We havent seen returning 0 as there is always a network to connect to in the lab. Also if it is not connected, the send_sensor_update () wouldn't get called.

    int ret;
    .
    .
    .
    ret = coap_send_request(COAP_METHOD_PUT, (const struct sockaddr *)&unique_local_addr, sensor_option, payload, sizeof(payload), NULL);
    LOG_INF ("ZS %d, RSSI %d, LQI %d, LQO %d, FW %04x RET: %d, RLOC: %04x", the_sensor_device->zoneState, RSSI, linkQalIn, linkQualOut, FWRevNum, ret, rloc);

    stop printing alltogether?

    What you mean is the printing suddenly stop without any reason? We havent seen anything like that.

    Print that it returns something else than 0?

    Yeah, we always see 47 so far.

Related