This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts
This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ZBOSS accepts join after leave with no rejoin

We originally saw (and reported) this problem with nRF5 SDK 4.1.0.  We switched to nRF Connect SDK 1.6.0 which did not exhibit this problem.  Now we just tested with nRF Connect SDK 1.7.0 and the problem is back.  

We see this in the zboss release notes and am wondering if they're just going back and forth with an incorrect fix:

  • [ZBS-98] Rejoin after Leave+Rejoin at ZR is broken in r23 - maybe, in r22 also

Commissioning was NOT enabled at the time of the leave request.

The device all successfully left the network. 

In 1.6.0 we would see the devices try to join again but they would be rejected with the device authorization signal showing status 2.

In 1.7.0 we see that they are allowed to rejoin with the device authorization signal showing status 0, even though commissioning was not enabled.

  • I think it unlikely we'll be able to capture this in a sniffer log, at least in the near term.  As I mentioned above, this is a rare event.  I've only seen it once in the past two days even though I've been trying to recreate it.  It's much more likely that we get a zboss assertion as described in this ticket:  https://devzone.nordicsemi.com/f/nordic-q-a/79054/coordinator-assert-during-commission-of-24-devices

    With NCS 1.7.0 some of the file,line numbers have changed, but we still see these asserts when commissioning/leaving a large number of nodes too quickly.

  •  join-after-leave-at-time-1090-ish.pcapng

    I continued experimenting and was able to recreate the problem and capture it Wireshark.  The attached pcapng capture has a couple of commission/leave cycles.  The first cycle commissioned all 22 devices, and they all left when commanded.  The second cycle commissioned 19 of 22 devices.  The other 3 were still powered up, but did not successfully join.

    The interesting stuff seems to be happening between times t=808 and t=1090 in the wireshark capture. Here's what I think I see:

    While commissioning we get the last Device Announcement message (and associated zboss signal) at t=810.25.  (We think it likely that the zboss signal was sent at about 809.5 in response to the first announcement for that device, but can't correlate our logs that precisely.)

    We see subsequent Permit Join Request broadcasts from various nodes until about t=915.

    We think commissioning has ended at t=989 because it has been 180 seconds since the last ZB_ZDO_SIGNAL_DEVICE_ANNCE from zboss, at which time we start configuring the devices.

    I waited about 1.5 additional minutes and then started asking nodes to leave.  You can see the first leave request at t=1078.9 and the second at t=1084.3.

    Just after the second leave request I see a device association and key exchange sequence for addr=0xcc20 culminating in what appears to be a successful authorization at t=1090.44.

    Note that the first log sequence I posted yesterday had a delay of 5 minutes between the end of commissioning and the first leave request.

    I'm guessing that the Permit Join Requests following the last Device Announcement message kept commissioning active in the coordinator.  We had seen the recommendation to extend commissioning time in response to ZB_ZDO_SIGNAL_DEVICE_ANNCE and we are doing that, but it seems it might also be extended by the Permit Join Request messages.  If this is correct, how can we detect this condition and correctly track the commissioning state in the coordinator/network?

    Note that we generally do not see this extended commissioning behavior.  Could it be triggered by the 3 devices that had not yet joined the network?  (All our devices are endpoint+routers.)

  • Here's another capture showing the join after leave problem.  In this case I had added code to explicitly broadcast a permit joining request with the duration set to zero in hopes of avoiding the problem.  You can see this broadcast at time t=2376.6.  Then at t=2411.6 I start sending a series of leave requests, and at t=2424.3 you can see a node joining the network.

    This capture shows a PAN Identifier Conflict at t=2424.346, which seems suspicious.  Note that all devices were previously joined to this network, then asked to leave, and then on a subsequent commissioning all but 2 devices joined back.  Those 2 devices were left powered on when the problem occurred.

    Note that this capture starts with a series of commission and leave cycles that completed without problems.  I did have to reset the coordinator once because it was not sending beacons in response to beacon requests during 2 consecutive commissioning attempts (t-1005 to t=1305).


    join-after-leave-at-time-2440.pcapng

  • Hi,

    We are unable to reproduce this issue on our side. Could you please provide the steps to reproduce it?

    Additionally, can you please clarify the packet numbers where you see this issue, as we are unable to find the behavior described in your sniffer logs.

    Best regards,

    Marte

  • In the posts above I referenced packets by their Wireshark timestamp.  For example, when I referred to time t=1090.44 I was trying to point out the packet with a Wireshark Time field of 1090.44.  As shown in the image below, timestamp t=1090.44 corresponds to packet number 10729.

    As I mentioned above, this is rare and difficult to reproduce.  I have had the most luck reproducing it when I have a large number of router-capable nodes attached.  

    I have had most luck reproducing the problem when I first join all 22 devices to the network, then ask all 22 to leave, and then enable a single commissioning window.  In this second commissioning window I often find that not all 22 devices will join the network.  In the first example above, 3 devices did not join.  Then, after the commissioning window has expired, I ask devices to leave the network.  In some cases devices will re-join after leaving, as indicated in my previous posts.


    Our coordinator is built using NCS 1.7.0.  We have enabled legacy support for compatibility with older HA devices by calling zb_bdb_set_legacy_device_support(1).

    The 22 router-capable devices were built using the nRF5 SDK 4.1.0 zigbee_light_bulb example with a custom manufacturer and model ID.

    Please review both of my posts and attachements from yesterday, as these describe where in the logs I ask the devices to leave and when they re-join.  I also indicated other timestamps of packets that I thought were relevant and/or suspicious. 

    Thanks for your help.

Related