This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts
This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ZBOSS accepts join after leave with no rejoin

We originally saw (and reported) this problem with nRF5 SDK 4.1.0.  We switched to nRF Connect SDK 1.6.0 which did not exhibit this problem.  Now we just tested with nRF Connect SDK 1.7.0 and the problem is back.  

We see this in the zboss release notes and am wondering if they're just going back and forth with an incorrect fix:

  • [ZBS-98] Rejoin after Leave+Rejoin at ZR is broken in r23 - maybe, in r22 also

Commissioning was NOT enabled at the time of the leave request.

The device all successfully left the network. 

In 1.6.0 we would see the devices try to join again but they would be rejected with the device authorization signal showing status 2.

In 1.7.0 we see that they are allowed to rejoin with the device authorization signal showing status 0, even though commissioning was not enabled.

Parents
  • Hmm, I see that another router is still reporting it's "permit join" flag as set even 15 minutes after commissioning ended.  I wonder if that's how the devices are rejoining after leaving.  

  • Hi,

    I do not see any cases related to the previous bug in your case history. If it was reported in a support ticket here on DevZone, do you know the case number or have a link to the case? This will make it easier for me to look this up internally. I have tried to look for relevant bugs, but the closest I could find so far is bugfix ZOI-60 in ZBOSS 07/14/2020 (ZBOSS 07/14/2020 release notes), but I do not know if this is what you are referring to:

    • [ZOI-60] Network permits new devices after device leave

    I will ask our developers internally to clarify regarding ZBS-98.

    Larry Martin said:
    I see that another router is still reporting it's "permit join" flag as set even 15 minutes after commissioning ended

    This could explain why they are able to join. However, 15 minutes is a long time for the permit join flag to still be activated. Are nodes joining the network during this time? A joining node will broadcast a Mgmt_Permit_Joining_req command with a PermitDuration field set to the minimum commissioning time, bdbcMinCommissioningTime. This will cause nodes that receive this command to reset their timer, such that the permit joining flag period will be extended. If not, have you made any changes to bdbcMinCommissioningTime?

    Best regards,

    Marte

  • I think it unlikely we'll be able to capture this in a sniffer log, at least in the near term.  As I mentioned above, this is a rare event.  I've only seen it once in the past two days even though I've been trying to recreate it.  It's much more likely that we get a zboss assertion as described in this ticket:  https://devzone.nordicsemi.com/f/nordic-q-a/79054/coordinator-assert-during-commission-of-24-devices

    With NCS 1.7.0 some of the file,line numbers have changed, but we still see these asserts when commissioning/leaving a large number of nodes too quickly.

  •  join-after-leave-at-time-1090-ish.pcapng

    I continued experimenting and was able to recreate the problem and capture it Wireshark.  The attached pcapng capture has a couple of commission/leave cycles.  The first cycle commissioned all 22 devices, and they all left when commanded.  The second cycle commissioned 19 of 22 devices.  The other 3 were still powered up, but did not successfully join.

    The interesting stuff seems to be happening between times t=808 and t=1090 in the wireshark capture. Here's what I think I see:

    While commissioning we get the last Device Announcement message (and associated zboss signal) at t=810.25.  (We think it likely that the zboss signal was sent at about 809.5 in response to the first announcement for that device, but can't correlate our logs that precisely.)

    We see subsequent Permit Join Request broadcasts from various nodes until about t=915.

    We think commissioning has ended at t=989 because it has been 180 seconds since the last ZB_ZDO_SIGNAL_DEVICE_ANNCE from zboss, at which time we start configuring the devices.

    I waited about 1.5 additional minutes and then started asking nodes to leave.  You can see the first leave request at t=1078.9 and the second at t=1084.3.

    Just after the second leave request I see a device association and key exchange sequence for addr=0xcc20 culminating in what appears to be a successful authorization at t=1090.44.

    Note that the first log sequence I posted yesterday had a delay of 5 minutes between the end of commissioning and the first leave request.

    I'm guessing that the Permit Join Requests following the last Device Announcement message kept commissioning active in the coordinator.  We had seen the recommendation to extend commissioning time in response to ZB_ZDO_SIGNAL_DEVICE_ANNCE and we are doing that, but it seems it might also be extended by the Permit Join Request messages.  If this is correct, how can we detect this condition and correctly track the commissioning state in the coordinator/network?

    Note that we generally do not see this extended commissioning behavior.  Could it be triggered by the 3 devices that had not yet joined the network?  (All our devices are endpoint+routers.)

  • Here's another capture showing the join after leave problem.  In this case I had added code to explicitly broadcast a permit joining request with the duration set to zero in hopes of avoiding the problem.  You can see this broadcast at time t=2376.6.  Then at t=2411.6 I start sending a series of leave requests, and at t=2424.3 you can see a node joining the network.

    This capture shows a PAN Identifier Conflict at t=2424.346, which seems suspicious.  Note that all devices were previously joined to this network, then asked to leave, and then on a subsequent commissioning all but 2 devices joined back.  Those 2 devices were left powered on when the problem occurred.

    Note that this capture starts with a series of commission and leave cycles that completed without problems.  I did have to reset the coordinator once because it was not sending beacons in response to beacon requests during 2 consecutive commissioning attempts (t-1005 to t=1305).


    join-after-leave-at-time-2440.pcapng

  • Hi,

    We are unable to reproduce this issue on our side. Could you please provide the steps to reproduce it?

    Additionally, can you please clarify the packet numbers where you see this issue, as we are unable to find the behavior described in your sniffer logs.

    Best regards,

    Marte

  • In the posts above I referenced packets by their Wireshark timestamp.  For example, when I referred to time t=1090.44 I was trying to point out the packet with a Wireshark Time field of 1090.44.  As shown in the image below, timestamp t=1090.44 corresponds to packet number 10729.

    As I mentioned above, this is rare and difficult to reproduce.  I have had the most luck reproducing it when I have a large number of router-capable nodes attached.  

    I have had most luck reproducing the problem when I first join all 22 devices to the network, then ask all 22 to leave, and then enable a single commissioning window.  In this second commissioning window I often find that not all 22 devices will join the network.  In the first example above, 3 devices did not join.  Then, after the commissioning window has expired, I ask devices to leave the network.  In some cases devices will re-join after leaving, as indicated in my previous posts.


    Our coordinator is built using NCS 1.7.0.  We have enabled legacy support for compatibility with older HA devices by calling zb_bdb_set_legacy_device_support(1).

    The 22 router-capable devices were built using the nRF5 SDK 4.1.0 zigbee_light_bulb example with a custom manufacturer and model ID.

    Please review both of my posts and attachements from yesterday, as these describe where in the logs I ask the devices to leave and when they re-join.  I also indicated other timestamps of packets that I thought were relevant and/or suspicious. 

    Thanks for your help.

Reply
  • In the posts above I referenced packets by their Wireshark timestamp.  For example, when I referred to time t=1090.44 I was trying to point out the packet with a Wireshark Time field of 1090.44.  As shown in the image below, timestamp t=1090.44 corresponds to packet number 10729.

    As I mentioned above, this is rare and difficult to reproduce.  I have had the most luck reproducing it when I have a large number of router-capable nodes attached.  

    I have had most luck reproducing the problem when I first join all 22 devices to the network, then ask all 22 to leave, and then enable a single commissioning window.  In this second commissioning window I often find that not all 22 devices will join the network.  In the first example above, 3 devices did not join.  Then, after the commissioning window has expired, I ask devices to leave the network.  In some cases devices will re-join after leaving, as indicated in my previous posts.


    Our coordinator is built using NCS 1.7.0.  We have enabled legacy support for compatibility with older HA devices by calling zb_bdb_set_legacy_device_support(1).

    The 22 router-capable devices were built using the nRF5 SDK 4.1.0 zigbee_light_bulb example with a custom manufacturer and model ID.

    Please review both of my posts and attachements from yesterday, as these describe where in the logs I ask the devices to leave and when they re-join.  I also indicated other timestamps of packets that I thought were relevant and/or suspicious. 

    Thanks for your help.

Children
  • Hi,

    If your routers are still based on the nRF5 SDK v4.1.0, then that might explain why this is happening, as the fix is only in nRF Connect SDK for now. 

    Could you please try this workaround on your routers and see if you are still seeing this issue?

    Add the following:

    #ifndef ZB_ED_ROLE
    void nwk_permit_timeout(zb_uint8_t param);
    /**@brief Close network for joining locally as fast as possible.
     *
     * This function is called as a workaround for issue described in KRKNWK-5255.
     *
     * @param[in] bufid   Reference to the Zigbee stack buffer.
    */
    static void close_network_locally(zb_bufid_t bufid)
    {
        if (bufid)
        {
            zb_nlme_permit_joining_request_t *req_param = ZB_BUF_GET_PARAM(bufid, zb_nlme_permit_joining_request_t);
            req_param->permit_duration = 0;
            zb_nlme_permit_joining_request(bufid);
            NRF_LOG_INFO("Zigbee network closed");
        }
        else
        {
            /* If it was impossible to get a free buffer immediately,
             * schedule closing the network.
             */
            zb_ret_t zb_err_code = zb_buf_get_out_delayed(close_network_locally);
            if (zb_err_code != RET_OK)
            {
                NRF_LOG_ERROR("Unable to get buffer or schedule network close operation.");
            }
        }
    }
    #endif

    Call the function from ZB_ZDO_SIGNAL_LEAVE_INDICATION signal handler:

    #ifndef ZB_ED_ROLE
                    /* Close Zigbee network locally due to KRKNWK-5255. */
                    close_network_locally(zb_buf_get_out());
                    zb_time_t timeout_bi;
                    if ((zb_schedule_get_alarm_time(nwk_permit_timeout, ZB_ALARM_ANY_PARAM, &timeout_bi) != RET_OK) ||
                        (timeout_bi == 0))
                    {
                        close_network_locally(zb_buf_get_out());
                    }
                    else
                    {
                        NRF_LOG_INFO("Unable to close Zigbee network - network is opened vie permit join command (scheduled: %d, timeout: %d).",
                            zb_schedule_get_alarm_time(nwk_permit_timeout, ZB_ALARM_ANY_PARAM, &timeout_bi),
                            timeout_bi);
                    }
    #endif

    Best regards,

    Marte

Related