Mesh back and forth seems to break connection

Hi,

We have one customer having two CoAP hosts and some CoAP clients in the form of wireless sensors. The sensors are paired to a single host. The pairing is actually in the app level, where the sensor discovers the network IP of the host in pairing host. All the devices have the same PANID and network key.

Recently we have seen a scenario where some sensors seemingly stopped communication with the paired host. By looking at the RSSI graphs, we thought is this caused by a sensor constantly swinging back and forth between two hosts (one host acting as a router). We dont have access to the CLI interface of the hosts as this is a remote site. We see the Sensor RSSI reported back. This is its RSSI with the router/leader immediately connected to at the time.  

Any ideas?

Cheers,

Kaushalya

Parents
  • Hello,

    Can you please try to capture a sniffer trace using the nRF Sniffer for 802154?

    What do you mean by "swinging back and forth between two hosts"? Do you mean that it (an End Device, I assume) keeps changing between two routers?

    Does it ever re-enter the network, or does it disconnect completely?

    Do the nodes move around (physically)? Or are the nodes more or less stationary?

    Best regards,

    Edvin

  • kaushalyasat said:
    I was referring to the error logs as marked above. I dont see any of these errors in my case. So I assume that 'coap_send_request ()' executes without any error. Am I correct?

    If you see error messages printed from that file in general. You can test with adding "LOG_ERR("Test");", to see if these error messages are visible in the log at all.

    kaushalyasat said:
    I dont this this is the case as I can see this log message continuously from a disconnected SED. 

    Where are those from? What trigger these?

    It would be more interresting to continuously see the return value from coap_send_message(), or whatever message you use to send, at the time of the disconnection?

    Try adding prints of the return value of the function that doesn't work (regardless of whether it is 0 or something else). 

    Does it:

    1: Print that it returns 0 even though it is disconnected?

    2: stop printing alltogether?

    3: Print that it returns something else than 0?

    BR,

    Edvin

  • If you see error messages printed from that file in general. You can test with adding "LOG_ERR("Test");", to see if these error messages are visible in the log at all.

    Yes I can see the LOG_ERR, from both my application level and also from coap_send_request(). I tested it from console and also RTT viewer.

    Where are those from? What trigger these?

    This log message is sent by send_sensor_update () in coap_client_utils.c, just before calling the  coap_send_request (). So we know the flow is working till that point. 

    It would be more interresting to continuously see the return value from coap_send_message()

    Agree. Unfortunately in fw Rev 1.1.1.0, which I sent to you first, doesnt have that - my bad. In latest fw it shows it and also we have implemented a noinit memory section where we keep the last returned value to coap_send_request(). Also in this section we maintain counters for failed tx and successful tx. So far we havent seen any failed, but again it might take months before that happen.

    Try adding prints of the return value of the function that doesn't work (regardless of whether it is 0 or something else). 

    It is done in the latest fw. We are waiting for any sensor to go into this mode again. Currently we get 47 as the return value, which I think the number of bytes send(?)

    Print that it returns 0 even though it is disconnected?

    It prints whatever returned from coap_send_request() as follows. We havent seen returning 0 as there is always a network to connect to in the lab. Also if it is not connected, the send_sensor_update () wouldn't get called.

    int ret;
    .
    .
    .
    ret = coap_send_request(COAP_METHOD_PUT, (const struct sockaddr *)&unique_local_addr, sensor_option, payload, sizeof(payload), NULL);
    LOG_INF ("ZS %d, RSSI %d, LQI %d, LQO %d, FW %04x RET: %d, RLOC: %04x", the_sensor_device->zoneState, RSSI, linkQalIn, linkQualOut, FWRevNum, ret, rloc);

    stop printing alltogether?

    What you mean is the printing suddenly stop without any reason? We havent seen anything like that.

    Print that it returns something else than 0?

    Yeah, we always see 47 so far.

  • kaushalyasat said:
    Yeah, we always see 47 so far.

    Ok, so even when you do not see that the messages are being received, you can see that it prints the return value (+)47, which is the length of the packet that you are sending?

    kaushalyasat said:
    Yes I can see the LOG_ERR, from both my application level and also from coap_send_request(). I tested it from console and also RTT viewer.

    Ok, good. That means that there is not some sort of config that disables that logging instance.

    kaushalyasat said:
    Also in this section we maintain counters for failed tx and successful tx. So far we havent seen any failed, but again it might take months before that happen.

    So this means that so far you have not confirmed that the tx function returns 47 while the issue is ongoing? Or have you confirmed this?

    Best regards,

    Edvin

  • Hi Edvin,

    Ok, so even when you do not see that the messages are being received, you can see that it prints the return value (+)47, which is the length of the packet that you are sending?

    Yes. Yesterday we had the first sensor fall off 'with child supervision'. I couldn't see any messages regarding child supervision though in these sensors. But I can verify the following.

    1. sensors are transmitting and the data is sent to its parent successfully. After the parent, I couldn't trace the packet any more as they cannot be decrypted. 

    2. intended destination doesnt receive these packets in application level

    So I think the issue is in loosing a FTD to FTD (router to router) connection in multi-hop scenarios. 

    So this means that so far you have not confirmed that the tx function returns 47 while the issue is ongoing? Or have you confirmed this?

    I can confirm that 47 is received while a sensor or more precisely the system in this state. As I mentioned, I dont think this is relevant  to the sensor (SED). This may well be an issue in router to router hop.

    I am now researching how the MLE works. What happens when a router looses its connection with another router? I think it will try to find another path. Now my question is why a path cannot be found to the destination router/leader, where it was established earlier? Only change would happen would be some  routers may be power cycles/off. But I have verified that the sensors could connect to the destination even with all the other routers are powered down. 

    Cheers,

    Kaushalya

  • Hello Kaushalya,

    So in theory, if we take out the child sensors, and the two routers send messages back and forth, we could see the same behavior? Do you agree? 

    Do you have a state changed callback in your parent's application?

    If you look at e.g. the ncs\nrf\samples\openthread\coap_server sample, in coap_server.c, you can see the on_thread_state_changed() callback.

    Does that trigger when the nodes fall out?

    BR,

    Edvin

Reply
  • Hello Kaushalya,

    So in theory, if we take out the child sensors, and the two routers send messages back and forth, we could see the same behavior? Do you agree? 

    Do you have a state changed callback in your parent's application?

    If you look at e.g. the ncs\nrf\samples\openthread\coap_server sample, in coap_server.c, you can see the on_thread_state_changed() callback.

    Does that trigger when the nodes fall out?

    BR,

    Edvin

Children
  • Hi Edvin,

    So in theory, if we take out the child sensors, and the two routers send messages back and forth, we could see the same behavior? Do you agree? 

    What you mean is like shutdown the child sensor? But if we remove child sensor, we loose the data packet origination. So after that the parent and next routers wouldn't get any data packets anyway. So the behavior is not same isn't it, though the end result is the same? Moreover, in this instant, the child sensor is successfully sending the packets.

    Do you have a state changed callback in your parent's application?

    Yes we have, as our application is derived of the same example. This is what we have.

    static void on_thread_state_changed(uint32_t flags, void *context)
    {
    	struct openthread_context *ot_context = context;
    
    	if (flags & OT_CHANGED_THREAD_ROLE) {
    		switch (otThreadGetDeviceRole(ot_context->instance)) {
    		case OT_DEVICE_ROLE_CHILD:
    		case OT_DEVICE_ROLE_ROUTER:
    		case OT_DEVICE_ROLE_LEADER:
    			dk_set_led_on(OT_CONNECTION_LED);
    			break;
    
    		case OT_DEVICE_ROLE_DISABLED:
    		case OT_DEVICE_ROLE_DETACHED:
    		default:
    			dk_set_led_off(OT_CONNECTION_LED);
    			deactivate_provisionig();
    			break;
    		}
    	}
    }

    We don't do anything special here. Also in the production hosts, we dont populate this LED. 

    Does that trigger when the nodes fall out?

    I am afraid we don't know. As we only see this issue once it happened. What you suggest is role change from 'router' to 'disabled' kind of? I will check the role of the host once I detect this again. I will populate the LED mentioned above  as well.

    I am trying to understand what the MLE does in the following scenario.

    1. sensor  -> parent router -> destination router - good

    2. sensor -> parent router  -x- destination router - bad

    Q1. If both routers support child supervision, does the supervision packet originate from the destination router or the parent router?

    Q2. Sensor wouldn't know scenario 2 has happened?

    Q3. I guess when scenario 2 happens parent router would try to find another path via MLE? In our case we have about 10 routers all over the lab. One or two may be power cycled or shut down but not all and they are never moved from one place to the other. Also when I shutdown all the other routers, the sensors all connected back to the destination router. That proves there was at least one path to the parent router. (These sensors were about 10-50cm away from the parent router)

    Cheers,

    Kaushalya

    Cheers,

    Kaushalya

  • kaushalyasat said:
    But if we remove child sensor, we loose the data packet origination. So after that the parent and next routers wouldn't get any data packets anyway. So the behavior is not same isn't it, though the end result is the same?

    Of course, but I am saying that if you were to generate dummy data at the parent and send it to the destination, in theory we could see the same, because the packets are lost between the parent router and the destination router. I am just thinking of possible ways to easier reproduce the issue.

    kaushalyasat said:
    We don't do anything special here. Also in the production hosts, we dont populate this LED. 

    Is it possible to add some logging here, to see if there is a state change? It doesn't necessarily change to detached, so printing the actual state change can be helpful, if this is the case.

    kaushalyasat said:
    Q1. If both routers support child supervision, does the supervision packet originate from the destination router or the parent router?

    I don't understand this question. And I don't understand how child supervision is relevant if the packet is lost between two nodes that are not children.

    Q2: Depends. Is it a message that is supposed to be acknowledged or not? It wouldn't know that the scenario 2 happened, but it can detect a missing acknowledgement. The routers should be able to detect it, though. Because all messages should be acknowledged in the 802.15.4 layer.

    Q3: Could it be that the packet path between the source and destination was through a router that was suddenly powered off? I am not 100% sure of the details, but openthread is not known for being very fast at determining that nodes are gone, and enabling new routes. 

    Best regards,

    Edvin

  • Hi Edvin,

    Is it possible to add some logging here, to see if there is a state change? It doesn't necessarily change to detached, so printing the actual state change can be helpful, if this is the case.

    Something interesting I saw couple of days ago with a FTD on my desk. Normally when it powers up it connects to the network and change it's state to a router/leader. But this time it stayed on as a child! Now if a router, who is actively routing packets to another router suddenly change state to a child, I guess we can explain the behavior we are discussing here.

    I guess a FTD will remain a child as long as no child connects to it, am I correct? But on the other hand I have seen many instances where a router's child table is empty, but it remains as a router. What conditions determine this transition from child-router and vice versa? 

    I don't understand this question. And I don't understand how child supervision is relevant if the packet is lost between two nodes that are not children.

    Sorry, what I was trying to get at is, we implemented child supervision to prevent a SED disconnection from a destination router, but may be not a solution here. 

    Is it a message that is supposed to be acknowledged or not?

    I can see that in 'coap_send_request (...)' function, it is hard coded to be non-ack type (COAP_TYPE_NON_CON). I dont know if there is a way to request with confirmation using any API. 

    Suppose I clone this 'coap_send_request (...)' and modify it to send ACK packet, then I would need to pass a call back with it to know the ack is successful or not?

    Could it be that the packet path between the source and destination was through a router that was suddenly powered off?

    This is possible. Our understanding is even it takes time to renegotiate new paths, it should be ok as we are sending temperature info in every 30 sec. Even if the rerout takes couple of minutes, it should be ok. But the issue is once the sensor's (SED)  data is lost, it remains so for days.

    Cheers,

    Kaushalya

  • Hi Edvin,

    Dont know if this is relevant for this issue, but today I saw a parent router going into some error state. I was provisioning and un-provisioning an SED from a FTD for a new fw feature and all of a sudden I noticed the prov reply send by the FTD does not reach the SED. When I look into wireshark log, I could see the parent router not sending the reply back. Also when I had a look at TTM, it showed this parent router grayed out and ext address field was empty. After a while it 'auto-healed' by itself and my provisioning packets are now received by the SED. I wonder if this could be the error state the router ends up when we cant get SED data.

    What do you think? 

    Following is a pcap of when the provisioning was working and stopped. Provisioning router is 0xb000, provisioning SED is 0x9423 and the parent router 0x9400


    May-30-prov success and fail.pcapng

    I have also created a separate thread to show the TTM findings. An OpenThread router shown as grayed out in TTM  

    Cheers,

    Kaushalya

  • Hello Kaushalya,

    kaushalyasat said:
    Normally when it powers up it connects to the network and change it's state to a router/leader. But this time it stayed on as a child!

    As you can see in the Openthread documentation, the network will strive to keep the number of routers below 15, if it is feasible. This is in order to be able to expand the network in any direction (physically) when needed. Therefore, if there are no devices that need to attach to your new device, and you already have a large amount of routers, there is no need to promote that device to a router. 

    kaushalyasat said:
    I guess a FTD will remain a child as long as no child connects to it, am I correct?

    Yes, as long as there is a decent coverage of routers in the area of the new child. 

    kaushalyasat said:

    I can see that in 'coap_send_request (...)' function, it is hard coded to be non-ack type (COAP_TYPE_NON_CON). I dont know if there is a way to request with confirmation using any API. 

    Suppose I clone this 'coap_send_request (...)' and modify it to send ACK packet, then I would need to pass a call back with it to know the ack is successful or not?

    If there are too few routers in the network, it will promote some of them to routers, so that the network is ready to accept new child nodes.

    kaushalyasat said:
    This is possible. Our understanding is even it takes time to renegotiate new paths, it should be ok as we are sending temperature info in every 30 sec. Even if the rerout takes couple of minutes, it should be ok. But the issue is once the sensor's (SED)  data is lost, it remains so for days.

    Child supervision will not prevent nodes from dropping out of the network. It will just make it easier to detect whether it has happened or not.

    ...

    I am surprised that it doesn't take into account whether you provide a callback function or not, and that you could specify whether the coap_send_request() should be acked or not. But you are right, you can clone this function, and tell it to use COAP_TYPE_CON. Then you can also provide a coap_reply_t reply_cb, which will trigger when the callback occurs. I believe the node will automatically retransmit messages that are not acked, but you can use this while debugging.

    ...

    I agree that it shouldn't happen. If the connection to the parent is lost (because it is powered off), then the child should try to reconnect to the network through some other node. 

    kaushalyasat said:
    After a while it 'auto-healed' by itself and my provisioning packets are now received by the SED.

    I would say that this is the expected behavior (from all other than the router that suddenly disconnected), don't you agree? 

    You don't happen to have any logs from the router that suddenly greyed out?

    Best regards,

    Edvin

Related