Zigbee stability & scalability issues: nRF5 SDK vs nRF Connect SDK migration

Good morning Nordic team,
we are currently evaluating a migration of our Zigbee firmware from the legacy nRF5 SDK for Thread & Zigbee v4.2.0 to the nRF Connect SDK (Zephyr-based) on the nRF52840, and we would like your technical feedback before proceeding.

Our curret setup is:
SoC: nRF52840
SDK: nRF5_SDK_for_Thread_and_Zigbee v4.2.0 plus a Green Power add-on
Our role device is Zigbee Coordinator, and we manage network with a target size of about 60 devices (wired devices, battery devices and green power devices)


Issues observed in the field

1. Difficult device joining as network size increases. When the number of devices grows:
- New devices often fail to join on the first attempt
- Multiple retries are frequently required
- In some situations: the coordinator becomes unresponsive, or It resets during heavy join activity

2. Critical issues with battery-powered devices
The most problematic scenario is when multiple battery devices attempt to join and are queried immediately after joining (e.g. reading firmware version, model, etc.)
- Responses from devices are often lost or never received
- This is especially evident when many sleepy devices join in a short time

3.  Reports of batterydevices leaving the network (not fully confirmed). We have received reports from field installations that:
- Some battery-powered devices may leave the network after a period of time
- This behavior is not consistently reproducible in our internal tests
- We have not yet been able to confirm whether this is due to communication failures, polling issues or external factors.

We would like your guidance to understand if migrating to nRF Connect SDK (Zephyr + newer Zigbee stack) is expected to address these issues.

1. Joining reliability and scalability
Does the Zigbee stack in nRF Connect SDK provide improved joining reliability, especially with many devices joining in parallel?
Are there known improvements in:
handling of join procedures
network congestion during commissioning
internal buffering/resource management
 Is this type of behavior (join failures, retries, coordinator instability) a known limitation of the nRF5 SDK for Thread & Zigbee v4.2.0 stack?

2. Handling of sleepy (battery) devices
Does nRF Connect SDK improve communication reliability with sleepy end devices, especially:
immediately after joining
during burst traffic (multiple attribute reads)

3. Device drop / leave behavior
Could the observed devices leaving the network over time be linked to limitations in:
parent-child management
polling timeouts
buffer exhaustion
 Are these aspects improved in the newer stack?

4. Stability and resets
Are coordinator resets or lockups under load something you have seen on:
nRF5 SDK for Thread & Zigbee v4.2.0 Zigbee stack and are they mitigated in nRF Connect SDK?


Based on your experience, would you expect that migrating to nRF Connect SDK would:
significantly improve joining success rate
improve reliability with battery devices
reduce packet loss and missed responses
improve overall network stability at ~50 devices

Can we consider these issues as inherent limitations of the legacy SDK, rather than application-level problems?

We are trying to determine whether migrating to nRF Connect SDK is likely to concretely solve or significantly reduce the issues we are currently experiencing in production
Thank you in advance for your support.

Best regards,
Marco Arnaboldi

Related