Mesh back and forth seems to break connection

Hi,

We have one customer having two CoAP hosts and some CoAP clients in the form of wireless sensors. The sensors are paired to a single host. The pairing is actually in the app level, where the sensor discovers the network IP of the host in pairing host. All the devices have the same PANID and network key.

Recently we have seen a scenario where some sensors seemingly stopped communication with the paired host. By looking at the RSSI graphs, we thought is this caused by a sensor constantly swinging back and forth between two hosts (one host acting as a router). We dont have access to the CLI interface of the hosts as this is a remote site. We see the Sensor RSSI reported back. This is its RSSI with the router/leader immediately connected to at the time.  

Any ideas?

Cheers,

Kaushalya

Parents
  • Hello,

    Can you please try to capture a sniffer trace using the nRF Sniffer for 802154?

    What do you mean by "swinging back and forth between two hosts"? Do you mean that it (an End Device, I assume) keeps changing between two routers?

    Does it ever re-enter the network, or does it disconnect completely?

    Do the nodes move around (physically)? Or are the nodes more or less stationary?

    Best regards,

    Edvin

  • kaushalyasat said:

    From what we have seen, its not the node dropping off. The nodes (SEDs) are sending their data out to the associated parent. I think the issue is the parent gets into some error state all of a sudden and stops sending these packets any further. 

    But the supervision messages are only between the child and it's parent, so this message is not routed to any other routers than the parent itself. Right?

    kaushalyasat said:

    What puzzles me is how long it took for the problematic router to be healed. Also this 'healing' happened when I connect to the console, so dont know if that had any effect. As this is a rare event, very difficult to deep diagnose.

    How long did it take?

    kaushalyasat said:

    This is what I saw from the log while I was connecting to router 0x9400.

    So this is the log from the router. When did it crash? (timestamp?)

    kaushalyasat said:
    I guess if I wait 240 sec timeout, and try thread stop/start, I might get connected to a new router. What do you think?

    I would assume it does. But I guess the main issue here is that this is not happening at all times, right?

  • Hi Edvin,

    Sorry about my delay, I was not well past couple of days and only now back in work.

    But the supervision messages are only between the child and it's parent, so this message is not routed to any other routers than the parent itself. Right?

    No the message is send to another router. The parent is just routing it through. 

    I think we have identified a potential pitfall in openthread networks. That is partitioning. We think our failure mechanism can be explained by that. 

    Think of two routers and a set of sensors (SEDs). If both routers are in the same partition, then no issues. But for whatever reason one router lost its connection to the other, then there will be two network partitions. The problem happens when the Sensors can still reach both routers. If a sensor that is sensing data the router1 suddenly lost its connection and connected to router2, the data path will be broken, but the Sensor will happily stay in the current connection until another MLE session decide to connect back to router1. The issue is for the MLE protocol, every router in range is a potential parent, but this may not work for application data.

    It looks like we cant constrain the MLE process to limit the parent search to its own partition. We can implement a two way handshake from application level and initiate another MLE session from the Sensor, but there is no guarantee that it may not choose a parent from a different partition.

    So this is the log from the router. When did it crash? (timestamp?)

    Sorry Edvin, as this happened when we were not at office, we dont know exactly at what time this happened. So it may be very diff to extract it from the log.

    Cheers,

    Kaushalya

  • Hello Kaushalya,

    Interesting. If the parents split into two different networks, then there will exist two different networks with the same credentials, not able to communicate with one another. And it is not possible to connect the two networks using a SED device.

    The solution is "simple". You need to make sure that if you struggle with that the two partitions of the network can't reach one another, you need to add another router that can reach both the network partitions (so that they are no longer separate partitions). 

    I believe this is where we started this conversation. I remember asking for state changes to see if a router became leader (which it will if part of the network splits out in it's own partition). 

    Is it an option to add another router making sure the network doesn't split into two partitions?

    Best regards,

    Edvin

  • Hi Edvin,

    The issue we see here is actually not partitioning, but the SEDs cant seem to proactively dislodge from a router and do a MLE reattach to another. There is no guarantee that the SED will not be connected to the same router again. From the Thread standpoint this may be a perfectly valid scenario, but its a killer or non-recoverable error for us. 

    Does 'Search for a better parent' feature pick up a new router ?

    Now we are implementing two solutions.

    1. Make the partitionID available in our dashboard for each router. It the commissioning stage we make sure all routers form a single partition. If not we introduce additional routers until we can get a single partition.

    2. Instead of a single big network, we implement networks per router. We have one main router in a installation and if we need to bring in more, we commission them with main router, so that they all have same network credentials as the main router. Similarly the SEDs will be commissioned in. At the point of commissioning we will have a single contagious network per installation. This way, meshing will be limited within that network.

    The only other issue this doesn't solve is a one case a test user reported, where he says he didn't have no  more than one router in his system at all times, but he saw his sensors disconnecting. As this is only single case we have reported, we will park this scenario till we have more conclusive data.

    What do you think?

    Cheers,

    Kaushalya

  • If you don't have a stable Thread network, I am not sure whether Openthread is really the best protocol, but I guess it is a bit late to change this now?

    1: I think that sounds like a good idea, as it would make sure that your entire network is in one partition. However, it seems like this is not 100% going to work, since most of the time, you get all messages as expected, but after some months, the network suddenly splits. Perhaps caused by some added radio noise in the area. This means that this may not be detected during installation. But it could perhaps be used to detect the partitioning. I don't know whether both partitions have internet access, and hence, whether they can report that they are on a separate partition. You are more familiar with the setup, and can say whether this would work or not.

    2: Are you saying that you have two networks that each router will be part of? One small network between itself and the sensors, and one other "big" network with all the routers and the gateway? I am not sure how this would solve it if the big network struggles with partitioning. Then you will still not be able to reach all the sensors from the small networks.

    If the issue is that there is no 100% stable route between the sensor and the gateway, then adding more routers seems like adding more routers is the way to go. The key is to understand when this is the case, and your proposal #1 would perhaps help with this?

    Best regards,

    Edvin

Reply
  • If you don't have a stable Thread network, I am not sure whether Openthread is really the best protocol, but I guess it is a bit late to change this now?

    1: I think that sounds like a good idea, as it would make sure that your entire network is in one partition. However, it seems like this is not 100% going to work, since most of the time, you get all messages as expected, but after some months, the network suddenly splits. Perhaps caused by some added radio noise in the area. This means that this may not be detected during installation. But it could perhaps be used to detect the partitioning. I don't know whether both partitions have internet access, and hence, whether they can report that they are on a separate partition. You are more familiar with the setup, and can say whether this would work or not.

    2: Are you saying that you have two networks that each router will be part of? One small network between itself and the sensors, and one other "big" network with all the routers and the gateway? I am not sure how this would solve it if the big network struggles with partitioning. Then you will still not be able to reach all the sensors from the small networks.

    If the issue is that there is no 100% stable route between the sensor and the gateway, then adding more routers seems like adding more routers is the way to go. The key is to understand when this is the case, and your proposal #1 would perhaps help with this?

    Best regards,

    Edvin

Children
  • Are you saying that you have two networks that each router will be part of? One small network between itself and the sensors, and one other "big" network with all the routers and the gateway?

    Not exactly. We need a router per installation. This installation is like a home. Then the sensors in that property shall be commissioned to that router. If certain sensors are out of reach, we provision a range extender (another router) to bridge the gap. All in all, the entire property shall have a single network. 

    The neighboring property, say this is a multi-storey apartment building, will have another network on their own as I mentioned before. So each property will have a small network of their own. This way a sensor of one property shall not mesh to a router in another property, which could yield to partitioning.

    Still a partition could happen inside a property due to some radio disturbance later on. All the property owner need to do is 

    1. power cycle each sensor that has disconnected.

    2. or worst case power cycle the router/routers. 

    Also as we report partitionID of each router to a dashboard, we can remotely detect that there is partitioning inside a property, as all the routers that belong to a property are grouped together. Only primary router will be connected to internet and if we cant see other routers, that means they are partitioned. 

    I have had a discussion with OpenThread GIT group and their opinion is the partitioning and the SEDs behaviors are as part of the OpenThread spec. But I think this behavior is not correct as it can yield to non-recoverable communication failures in the network. Even if we detect partitioning in the SEDs, there is nothing application level program can do to connect to a different parent.

    Only way I can think of is the routers to detect partitioning and do a individual network restart until a single partition is made. If after couple of cycles still not resolvable then the primary router should report this to the user and the dashboard. Then human intervention is needed to commission another router to resolve.

    Cheers,

    Kaushalya

Related