We wanted to share our feedback as we recently went through Zigbee 3.0.1 BDB/cluster certification using the ZBOSS with our end device product. While we were ultimately able to meet the CSA's requirements for certification, we faced a large number of issues with the example use of the stack that proved to be non-certifiable, despite the ZBOSS platform being 'certification ready' for Zigbee.
Our hope is that these issues are addressed internally within the NCS + ZBOSS Zigbee components in order to reduce friction for other developers (and us, in the future) when trying to bring a Zigbee product to market with Nordic.
Immediate Post-OTA Validation (OTA Client Cluster)
The Zigbee spec lays out (11.13.9.3):
If the image fails any integrity checks, the client SHALL send an Upgrade End Request command to the
upgrade server with a status of INVALID_IMAGE. In this case, the client MAY reinitiate the upgrade
process in order to obtain a valid OTA upgrade image. The client SHALL not upgrade to the bad image and
SHALL discard the downloaded image data.
In our application, we use OTA via the NRF SDK Zigbee FOTA library with MCUboot as the underlying bootloader. Since our images are signed for MCUboot, we will always inherently reject bad images and discard them, as expected. However, the stack does not manage the INVALID_IMAGE response, which is explicitly mandated by the CSA test cases around OTA. Since MCUboot does not provide an effective pathway for user-space validation of the OTA image, the most straightforward response to this is to download the image, reset into MCUboot to confirm the image is valid, then boot back and send an INVALID_IMAGE or OK response accordingly. However, since resetting would cause a disconnect from the Zigbee network, this is hard to do in a timely fashion + it is not clear if this would actually successfully pass the CSA test cases.
Our solution for this was to implement user-space validation for MCUboot images, where we decrypt/validate the image payload from the app space, which requires importing the signing/encryption keys into the app and duplicating many of the responsibilities from MCUboot. This is not a trivial solution - ideally, we would be able to call into MCUboot from the app space in the worst case to re-use the existing validation logic, rather than repeating it from the app space. The complication involved in this flow, unless there's a better solution that we are unaware of, greatly complicates one's ability to certify the OTA client cluster with ZBOSS + NCS.
FOTA Endpoint Restriction
The Zigbee FOTA implementation is restricted to a single endpoint, which is reserved and may only be used for the OTA client implementation. While this isn't against any spec, it does provide complication layers that we have found during product introduction. Namely, the testing software + harnesses established by the CSA are primarily designed around single-endpoint tests, and switching endpoints for OTA tests versus product functional tests proved to be a complicated communication barrier between our engineers + the test lab. Further, our Zigbee controller partners in the industry have provided us direct feedback that the multiple endpoint design for OTA is causing customers to have slower experiences during pairing / discovery of our Zigbee device, due to the additional endpoint discovery required.
This is a limitation explicitly called out in the documentation, but it isn't extremely transparent to end device developers, so we thought it was important to highlight since there were noticable caveats that arose during product certification and field trial. We will likely overcome this limitation by patching the FOTA system to enable better access to the FOTA endpoint.
Resetting Attribute State on Leave
The Zigbee BDB 3.0.1 spec defines (9.4):
Zigbee-PRO provides an Mgmt_Leave_req ZDO command which is designed to request that a remote node leaves the network by clearing all Zigbee persistent data (see sub-clause 6.9), except the outgoing NWK frame counter, and perform a reset such that the node is in much the same state as it was when it left the factory.
In our findings, a leave request does not trigger a wipe of persistent data, because some amount of data is cached in memory. While NVRAM writes are performed to wipe the network / attribute state as expected, and the network disconnects immediately, attribute state is still generally preserved in memory, allowing those values to carry over to a new network if the device is re-paired without power cycling. This causes the device to explicitly fail CSA's test suite, which specifically tests configuring a reporting interval on an attribute, sending a leave request, re-pairing, and testing to ensure the reporting interval is the default value, not the one previously configured. The sample applications handling the ZB_ZDO_SIGNAL_LEAVE signal do not demonstrate any need to reboot the system, causing the device to fail the CSA test cases (the default signal handler merely starts joining a new network upon leaving, without any reset).
Our workaround for this issue was simply to trigger a full factory reset + power cycle the DUT when we receive the appropriate leave signal (Mgmt_Leave_req with rejoin = false).
NWK_addr_req Extended Response
When configured as an end device, it seems that calls to NWK_addr_req with RequestType = Extended Response flag set are not well-handled. The specification does not clearly indicate the defined behavior when handling this request as an end device, only for coordinators and routers, in section 2.4.3.1.1.2:
If the RequestType was Extended response and the Remote Device is either the ZigBee coordinator or router, a NWK_addr_resp command shall be generated and sent back ...
As such, the expected behavior to respond to this command in this instance is not clear. However, we found that the CSA test cases test this behavior (end device receiving NWK_addr_req with extended request set) and failed our DUT. Seemingly, the stack behavior is to respond to NWK_addr_req with RequestType=Extended in the same way that it would respond if RequestType=Single Device, rather than throwing an error for an unsupported argument or otherwise handling the behavior. This caused an error in the CSA test executor as it tried to unpack a shortened, single device response as a longer, extended response, where the test was really not expecting a response in the first place. I am not clear if this is an explicit issue with the ZBOSS implementation, or a short-sided mistake in the test development from the CSA's perspective, but in any case, it did not pass this test and we had to receive a special exemption from the CSA to allow certifying this result. Since this behavior is contained within the ZBOSS stack, we did not have much surface area for a possible resolution on our side.
Access to standard ZBOSS PICS descriptors
One of the requirements for filing CSA certification is a collection of PICS files that describes fundamental behavior of the product + its networking behavior. While some of these aspects are user-controlled (e.g. which + how much of a cluster was implemented), some of the PICS files (especially the BDB 3.0.1 definition) rely on a lot of information that is internal to the ZBOSS stack and not user-facing. While many of the required PICS questions can be answered with testing + analysis of the Zigbee stack, we would expect that the NCS + ZBOSS certifiable implementation would provide basic information of the stack internals to fill out the PICS requirements plainly. We have found that other chipset manufacturers have provided detailed information about what PICS need to be filled out, and how to fill it out respective to their stack implementations. This can greatly reduce the amount of noise involved in approaching certification, since many of the nuanced questions are transferrable ZBOSS details that apply in all cases of implementation.
The issues listed above caused great friction with our test lab + with the CSA certification process, which was unexpected during final development and complicated our product timelines. Many of these issues are subtle and hard to catch until final certification is happening, which makes them especially poignant during the development process with ZBOSS.