Zigbee stability & scalability issues: nRF5 SDK vs nRF Connect SDK migration

Good morning Nordic team,
we are currently evaluating a migration of our Zigbee firmware from the legacy nRF5 SDK for Thread & Zigbee v4.2.0 to the nRF Connect SDK (Zephyr-based) on the nRF52840, and we would like your technical feedback before proceeding.

Our curret setup is:
SoC: nRF52840
SDK: nRF5_SDK_for_Thread_and_Zigbee v4.2.0 plus a Green Power add-on
Our role device is Zigbee Coordinator, and we manage network with a target size of about 60 devices (wired devices, battery devices and green power devices)


Issues observed in the field

1. Difficult device joining as network size increases. When the number of devices grows:
- New devices often fail to join on the first attempt
- Multiple retries are frequently required
- In some situations: the coordinator becomes unresponsive, or It resets during heavy join activity

2. Critical issues with battery-powered devices
The most problematic scenario is when multiple battery devices attempt to join and are queried immediately after joining (e.g. reading firmware version, model, etc.)
- Responses from devices are often lost or never received
- This is especially evident when many sleepy devices join in a short time

3.  Reports of batterydevices leaving the network (not fully confirmed). We have received reports from field installations that:
- Some battery-powered devices may leave the network after a period of time
- This behavior is not consistently reproducible in our internal tests
- We have not yet been able to confirm whether this is due to communication failures, polling issues or external factors.

We would like your guidance to understand if migrating to nRF Connect SDK (Zephyr + newer Zigbee stack) is expected to address these issues.

1. Joining reliability and scalability
Does the Zigbee stack in nRF Connect SDK provide improved joining reliability, especially with many devices joining in parallel?
Are there known improvements in:
handling of join procedures
network congestion during commissioning
internal buffering/resource management
 Is this type of behavior (join failures, retries, coordinator instability) a known limitation of the nRF5 SDK for Thread & Zigbee v4.2.0 stack?

2. Handling of sleepy (battery) devices
Does nRF Connect SDK improve communication reliability with sleepy end devices, especially:
immediately after joining
during burst traffic (multiple attribute reads)

3. Device drop / leave behavior
Could the observed devices leaving the network over time be linked to limitations in:
parent-child management
polling timeouts
buffer exhaustion
 Are these aspects improved in the newer stack?

4. Stability and resets
Are coordinator resets or lockups under load something you have seen on:
nRF5 SDK for Thread & Zigbee v4.2.0 Zigbee stack and are they mitigated in nRF Connect SDK?


Based on your experience, would you expect that migrating to nRF Connect SDK would:
significantly improve joining success rate
improve reliability with battery devices
reduce packet loss and missed responses
improve overall network stability at ~50 devices

Can we consider these issues as inherent limitations of the legacy SDK, rather than application-level problems?

We are trying to determine whether migrating to nRF Connect SDK is likely to concretely solve or significantly reduce the issues we are currently experiencing in production
Thank you in advance for your support.

Best regards,
Marco Arnaboldi

Parents
  • Hi Marco,

    I recommend migrating to nRF Connect SDK if possible. The nRF5 SDK for Thread and Zigbee is deprecated and does not receive any bug fixes or updates. Meanwhile, there have been multiple fixes and improvements directly related to stability/resource/join/rejoin issues in Zigbee in the nRF Connect SDK. I cannot guarantee that all the issues you are seeing have been fixed in the nRF Connect SDK, but you should see improvements after migrating.

    You can find a list of bug fixes and improvements in the release notes.

    I recommend migrating to R23, as that is the current version. However, I included a link to the R22 release notes as well since the R23 add-on branched out from that at some point, so part of the R22 release notes are still relevant.

    1. Joining reliability and scalability

    There are a few known issues in the nRF5 SDK that relate to the coordinator not accepting new children, asserting when commissioning multiple devices, and performance issues during heavy flash operations (such as association, rejoin, leave, GreenPower commissioning and others). You can find them under limitations in nRF5 SDK for Thread and Zigbee v4.2.0 : Introduction.

    Here are some fixes in the ZBOSS release notes for nRF Connect SDK that relate directly to these issues:

    • [KRKNWK-19276] New children not accepted if the maximum number of children was reached
    • [ZOI-3718] ZC responds to beacon with association permit false when joining should be open
    • [ZBS-2086] - Due to IEEE hash collision parallel commissioning might override APS Key of another device
    • [ZBS-2313] - Fix key_negotiation_ctx and resources usage for all devices that join using R22 commissioning procedure to ZC
    • [ZBS-2256] - Fix allocation of ZB_CONFIG_ROLE_ZR device DLK context to allow devices join through it

    There are also multiple fixes regarding OOM, retransmission recovery, max-buffer allocation checking, insufficient APS transport-key buffer space, and APS confirm memory leakage in R23.

    2. Handling of sleepy (battery) devices

    For sleepy devices, the improvements are mostly on the sleepy end device itself, and not the coordinator. However, the resource management, rejoin, retransmission, and buffer-handling fixes could indirectly improve reliability with third-party sleepy end devices, especially during burst traffic or immediately after join.

    3. Device drop / leave behavior

    Yes, the things you mention can be possible causes for devices dropping/leaving. And again, fixes related to OOM, buffering, etc., could improve this, such as:

    • [ZBS-2031] - ZC may issue RejoinResponse with address 0x0000
    • [KRKNWK-17472] Zigbee devices stops sending/receiving packets when jammed or high wireless traffic is present
    • [ZBS-2180] - Handle packets retransmission properly in case of OOM if a packet was ignored and window size two is used as it takes long time to recover
    • [ZBS-1902] - Add a parameters check for max buffer size during allocation and raise an error to prevent out of buffer case
    • [ZBS-2211] - Process return value from delayed buffer allocation in Green Power to prevent potential memory leakage
    • Improvement: Handling transmission when acknowledgment is not received
    4. Stability and resets

    This can be related to the known performance issue during heavy flash operations. It should also be improved by KRKNWK-17472, as well as several of the other fixes I have mentioned (OOM, buffer handling, etc.).

    Best regards,
    Marte

Reply
  • Hi Marco,

    I recommend migrating to nRF Connect SDK if possible. The nRF5 SDK for Thread and Zigbee is deprecated and does not receive any bug fixes or updates. Meanwhile, there have been multiple fixes and improvements directly related to stability/resource/join/rejoin issues in Zigbee in the nRF Connect SDK. I cannot guarantee that all the issues you are seeing have been fixed in the nRF Connect SDK, but you should see improvements after migrating.

    You can find a list of bug fixes and improvements in the release notes.

    I recommend migrating to R23, as that is the current version. However, I included a link to the R22 release notes as well since the R23 add-on branched out from that at some point, so part of the R22 release notes are still relevant.

    1. Joining reliability and scalability

    There are a few known issues in the nRF5 SDK that relate to the coordinator not accepting new children, asserting when commissioning multiple devices, and performance issues during heavy flash operations (such as association, rejoin, leave, GreenPower commissioning and others). You can find them under limitations in nRF5 SDK for Thread and Zigbee v4.2.0 : Introduction.

    Here are some fixes in the ZBOSS release notes for nRF Connect SDK that relate directly to these issues:

    • [KRKNWK-19276] New children not accepted if the maximum number of children was reached
    • [ZOI-3718] ZC responds to beacon with association permit false when joining should be open
    • [ZBS-2086] - Due to IEEE hash collision parallel commissioning might override APS Key of another device
    • [ZBS-2313] - Fix key_negotiation_ctx and resources usage for all devices that join using R22 commissioning procedure to ZC
    • [ZBS-2256] - Fix allocation of ZB_CONFIG_ROLE_ZR device DLK context to allow devices join through it

    There are also multiple fixes regarding OOM, retransmission recovery, max-buffer allocation checking, insufficient APS transport-key buffer space, and APS confirm memory leakage in R23.

    2. Handling of sleepy (battery) devices

    For sleepy devices, the improvements are mostly on the sleepy end device itself, and not the coordinator. However, the resource management, rejoin, retransmission, and buffer-handling fixes could indirectly improve reliability with third-party sleepy end devices, especially during burst traffic or immediately after join.

    3. Device drop / leave behavior

    Yes, the things you mention can be possible causes for devices dropping/leaving. And again, fixes related to OOM, buffering, etc., could improve this, such as:

    • [ZBS-2031] - ZC may issue RejoinResponse with address 0x0000
    • [KRKNWK-17472] Zigbee devices stops sending/receiving packets when jammed or high wireless traffic is present
    • [ZBS-2180] - Handle packets retransmission properly in case of OOM if a packet was ignored and window size two is used as it takes long time to recover
    • [ZBS-1902] - Add a parameters check for max buffer size during allocation and raise an error to prevent out of buffer case
    • [ZBS-2211] - Process return value from delayed buffer allocation in Green Power to prevent potential memory leakage
    • Improvement: Handling transmission when acknowledgment is not received
    4. Stability and resets

    This can be related to the known performance issue during heavy flash operations. It should also be improved by KRKNWK-17472, as well as several of the other fixes I have mentioned (OOM, buffer handling, etc.).

    Best regards,
    Marte

Children
No Data
Related