Mesh back and forth seems to break connection

Hi,

We have one customer having two CoAP hosts and some CoAP clients in the form of wireless sensors. The sensors are paired to a single host. The pairing is actually in the app level, where the sensor discovers the network IP of the host in pairing host. All the devices have the same PANID and network key.

Recently we have seen a scenario where some sensors seemingly stopped communication with the paired host. By looking at the RSSI graphs, we thought is this caused by a sensor constantly swinging back and forth between two hosts (one host acting as a router). We dont have access to the CLI interface of the hosts as this is a remote site. We see the Sensor RSSI reported back. This is its RSSI with the router/leader immediately connected to at the time.  

Any ideas?

Cheers,

Kaushalya

Parents
  • Hello,

    Can you please try to capture a sniffer trace using the nRF Sniffer for 802154?

    What do you mean by "swinging back and forth between two hosts"? Do you mean that it (an End Device, I assume) keeps changing between two routers?

    Does it ever re-enter the network, or does it disconnect completely?

    Do the nodes move around (physically)? Or are the nodes more or less stationary?

    Best regards,

    Edvin

  • Hi Edvin,

    I am trying to implement Child supervision in hope that will help. Unfortunately the documentation is not very clear to me. It is mentioned to enable 'OPENTHREAD_CONFIG_CHILD_SUPERVISION_ENABLE', which is defined in child_supervision.h in SDK. I dont think changing SDK content is a good idea as if the SDK is updated, we will loose all that. 

    1. Is there a way to implement child supervision from application side?

    2. What is to be done from FTD side?

    3. What is to be done from SED side?

    Thanks,

    Kaushalya

  • Hi Edvin,

    We have WDT implemented in the client, so unlikely the sensor crashes. The sensors as they are battery powered devices, have the console disabled. I think we can have special build with consoles enabled just for this.

    Thanks,

    Kaushalya

  • kaushalyasat said:
    I dont think changing SDK content is a good idea as if the SDK is updated, we will loose all that. 

    Most of these are only defined if they are not defined elsewhere:

    /**
     * @def OPENTHREAD_CONFIG_CHILD_SUPERVISION_ENABLE
     *
     * Define to 1 to enable Child Supervision support.
     *
     */
    #ifndef OPENTHREAD_CONFIG_CHILD_SUPERVISION_ENABLE
    #define OPENTHREAD_CONFIG_CHILD_SUPERVISION_ENABLE 0
    #endif
    

    So if you define it somewhere in your project, outside the SDK, that should take precedence. 

    kaushalyasat said:

    2. What is to be done from FTD side?

    3. What is to be done from SED side?

    Do you have logging enabled on any of them?

    kaushalyasat said:
    I think we can have special build with consoles enabled just for this.

    I think so. We need to investigate what's going on. The logs could reveal something.At this point, all we know is that the devices "disappear". Perhaps they are stuck in some weird state / state machine?

    Best regards,

    Edvin

  • Hi Edvin,

    Ok I got your point in Q1. I was not sure the order the #defines are evaluated.

    I have logging available in both SED and FTD builds.

    I had to enable child supervision in Kconfig first.

    I have done this in SED

    #define OPENTHREAD_CONFIG_CHILD_SUPERVISION_ENABLE 1
    #define OPENTHREAD_CONFIG_CHILD_SUPERVISION_CHECK_TIMEOUT 60
    #define OPENTHREAD_CONFIG_CHILD_SUPERVISION_MSG_NO_ACK_REQUEST 0
    and this in the FTD
    #define OPENTHREAD_CONFIG_CHILD_SUPERVISION_ENABLE 1    
    #define OPENTHREAD_CONFIG_CHILD_SUPERVISION_INTERVAL 30
    #define OPENTHREAD_CONFIG_CHILD_SUPERVISION_MSG_NO_ACK_REQUEST 0
    My idea is FTD will send supervision data every 30 sec and SED will initiate a MLE attach if it doesn't get
    this in 60 sec. Am I correct?
    Is there a way for me to drop a child from FTD's child table using CLI? This is for me to check if the MLE reattach works from SED side.
    Results:
    1. When I check "childsupervision checktimeout" from SED, I still get 190, which is default in child_supervision.h. It seems redefinition in my app has no effect. 
    2. I can see the SED in child table of FTD.
    3. I dont see "Sending supervision message to child" from FTD console. 
    4. I dont see any supervision messages from SED.
    Cheers,
    Kaushalya
  • When I set childsupervision checktimeout to 10 in SED, I see "<wrn> [W] ChildSupervsn-: Supervision timeout. No frame from parent in 10 sec". But cant see any MLE reattachment.

    At this time the childsupervision interval in FTD is 129.

    Then I set childsupervision interval in FTD to 5, hoping this would stop the previous message in SED, but it didnt.

    Even after I set it back to its original 129 value, the SED keep on sending the same message.

    I am really puzzled what is going on here.

Reply
  • When I set childsupervision checktimeout to 10 in SED, I see "<wrn> [W] ChildSupervsn-: Supervision timeout. No frame from parent in 10 sec". But cant see any MLE reattachment.

    At this time the childsupervision interval in FTD is 129.

    Then I set childsupervision interval in FTD to 5, hoping this would stop the previous message in SED, but it didnt.

    Even after I set it back to its original 129 value, the SED keep on sending the same message.

    I am really puzzled what is going on here.

Children
  • kaushalyasat said:

    Ok I got your point in Q1. I was not sure the order the #defines are evaluated.

    I agree that this is confusing, and what if it says this in several files? It is not easy to know the order that the files are included. So I do indeed see this confusion. To be sure, I suggest that you print out the value of OPENTHREAD_CONFIG_CHILD_SUPERVISION_ENABLE to your log, to ensure that it is in fact set to 1. Since this will be set during compile time, whatever is written to your log should be valid throughout the application. 

    kaushalyasat said:
    My idea is FTD will send supervision data every 30 sec and SED will initiate a MLE attach if it doesn't get
    this in 60 sec. Am I correct?

    That is correct. Please see:

    https://openthread.io/guides/build/features/child-supervision

    However, setting these alone doesn't ensure that the SED wakes up. 

    CONFIG_OPENTHREAD_POLL_PERIOD decides how often the SED will wake up to listen for packets from it's parent. I am not sure whether you changed it, but I see that the default value for this is 30 seconds in some of our samples (overlay-mtd.conf). 

    You may want to set your OPENTHREAD_CONFIG_CHILD_SUPERVISION_CHECK_TIMEOUT to a little more than 60 seconds, e.g. 70. This is to allow for a second attempt in the case where the initial child supervision interval message reach the child. Then it has a second attempt before deciding to leave the network.

    kaushalyasat said:

    When I set childsupervision checktimeout to 10 in SED, I see "<wrn> [W] ChildSupervsn-: Supervision timeout. No frame from parent in 10 sec". But cant see any MLE reattachment.

    At this time the childsupervision interval in FTD is 129.

    Then I set childsupervision interval in FTD to 5, hoping this would stop the previous message in SED, but it didnt.

    Even after I set it back to its original 129 value, the SED keep on sending the same message.

    I am really puzzled what is going on here.

    I believe this is all caused because the poll period is too long. Even if you set the child supervision interval and check timeout to 5 and 10 seconds, the poll period for the child is still probably 30 or 236 seconds unless you specified it to be something else, and in that case, the child will not wake up until these intervals have passed. 

    Best regards,

    Edvin

  • Hi Edvin,

    My poll period is 10 sec, as set in overlay_mtd.conf. The child supervision interval in FTD is 30 and checkout time in SED were #defined to 60, but when I check it from CLI, it was the default 129 and 190. Even after I print these values to the log messages, my #define values seems to get overwritten. So I used the API functions otChildSupervisionSetCheckTimeout() in SED and otChildSupervisionSetInterval() in FTD. After this, I can see the values are 30 and 60 as I expected. I think this part is solved. Now when I set the interval to like 70, I get the "ChildSupervsn-: Supervision timeout. No frame from parent in 60 sec" from SED. When interval is set to 30, I see "ChildSupervsn-: restart timer" from SED. So apparently child supervision mechanism is functioning. 

    Also I enabled CONFIG_OPENTHREAD_LOG_LEVEL_INFO=y in prj.conf of the FTD and I can see "ChildSupervsn-: sec sinse ast sup [number]" regularly from FTD. What is this message?

    Now the only thing is I still cant see the full effect of child supervision as I cant manually drop a SED from the child table. I can hide the SED in a Faraday pouch until it FTD drops it from its child table, but when I take the SED out, it connects back even without the child supervision feature, after backoff periods. Even after I increase supervision interval from the FTD, the SED says its supervision timeouts, but I cant see any console messages indicating MLE reconnect. Can this be seen from a wireshark session? What should I look for?

    Thanks,

    Kay

  • kaushalyasat said:
    Also I enabled CONFIG_OPENTHREAD_LOG_LEVEL_INFO=y in prj.conf of the FTD and I can see "ChildSupervsn-: sec sinse ast sup [number]" regularly from FTD. What is this message?

    I would assume that this is the time since the last supervision message. Does that seem to be correct, according to your supervision interval?

    kaushalyasat said:
    Now the only thing is I still cant see the full effect of child supervision as I cant manually drop a SED from the child table. I can hide the SED in a Faraday pouch until it FTD drops it from its child table,

    Or you can power off, or halt one of the devices, the parent  or the child, in case this is practically easier to do. 

    kaushalyasat said:
    Now the only thing is I still cant see the full effect of child supervision

    I am not sure I understand you correctly here. Do you expect the child to be completely removed from the network? Because I don't think that this is the case. I believe that this is intended to keep track of what devices that are active in your network. If it drops out due to child supervision timeout, then it no longer makes sense to send messages to that device. But you wouldn't typically want a device that runs out of battery to actually drop out of the network, so that you would need to re-commission it when you change the batteries. Or in case of a power outage, you don't want to need to set up the entire network from scratch when the power comes back on. 

    Best regards,

    Edvin

  • Hi Edvin,

    Now it seems correctly working based on console messages and their timing. 

    On SED console

    [00:00:36.580,535] <inf> [N] ChildSupervsn-: restart timer
    [00:01:06.582,611] <inf> [N] ChildSupervsn-: restart timer
    [00:01:36.582,122] <inf> [N] ChildSupervsn-: restart timer

    On FTD console

    [01:11:59.735,687] <inf> [I] ChildSupervsn-: sec sinse ast sup 28
    [01:12:00.731,903] <inf> [I] ChildSupervsn-: sec sinse ast sup 29
    [01:12:01.736,145] <inf> [I] ChildSupervsn-: Sending supervision message to child 0xcc01
    [01:12:02.740,173] <inf> [I] ChildSupervsn-: sec sinse ast sup 1
    [01:12:03.738,311] <inf> [I] ChildSupervsn-: sec sinse ast sup 2

    This seems FTD counts supervision interval every 1Sec and every 30 sec it sends supervision packet t to the child? (I have set the interval to 30 sec in FTD and timeout to 60 in SED.) I think in this instance the child supervision is functioning.

    "Now the only thing is I still cant see the full effect of child supervision" What I meant was I cant see the child supervision bringing back a SED from this 'fallen off' state. For me to verify the effect of child supervision, I have to see a child which has dropped off the child table, rejoining the network. 

    What I have seen when this drop off issue comes up in the field, is that the child has dropped from the Leaders child table. I don't think the child stopped sending the poll packet for longer than 240 sec. There were no other routers for the child to jump to as well. So I have the feeling this may be some Thread related issue. 

    Cheers,

    Kaushalya

  • Even without child supervision, the SEDs reattach to the network and starts sending data to the paired FTD. Like if I take the SED out of reach and back I can see the disconnection reconnection. So it doesn't seem in this rare problem, the SEDs actually disconnect from the network from FTD's side but seemingly connected from SEDs side. 

Related