Zigbee ZBOSS Fatal Error After Changing ZB_DEV_REJOIN_TIMEOUT_MS

We are producing an issue where the ZBOSS stack will emit a ZBOSS Fatal Error without any context if we alter ZB_DEV_REJOIN_TIMEOUT_MS. Specifically, the steps to reproduce are:

1. Build light_bulb sample with device configured as an sleepy end device with TC_REJOIN enabled, adding the build flag ZB_DEV_REJOIN_TIMEOUT_MS=604800000

west build -b nrf52840dk_nrf52840 -- -DCMAKE_C_FLAGS="-DZB_DEV_REJOIN_TIMEOUT_MS=604800000"

2. Connect to a hub successfully

3. Remove power from the hub so that the end device enters rejoin steering. After 30-35 minutes of steering, we observe a ZBOSS fatal error.

In these circumstances, we are producing this failure every time on the light_bulb sample app with nrf52840dk. Since the fatal error is emitted within the ZBOSS stack and has no surrounding context, we are unsure what the underlying cause of the failure is. If we do not modify ZB_DEV_REJOIN_TIMEOUT_MS, a fatal error is not seen. We'd like to increase ZB_DEV_REJOIN_TIMEOUT in order to allow the device to be more resilient to loss of power in the coordinator, however this behavior seems to prohibit that.

nRF Connect SDK v2.6.0 - below attached is console log, ZBOSS trace, and built hex app which produces the issue.

/cfs-file/__key/communityserver-discussions-components-files/4/1854.merged.hex

/cfs-file/__key/communityserver-discussions-components-files/4/zboss_2D00_fatal_2D00_error.log.zip

  • I believe we have a workaround for KRKNWK-12017 in place, and I don't think it would alleviate any behavior here since ultimately it also just triggers a reset which we're trying to avoid.

    This is on NCS 2.6.1

    Thanks

    Kendall

  • Hi Kendall, 

    Just letting you know that we've now reproduced this on our side. I'll keep you updated.

    Regards,

    Elfving

  • Elfving,

    I believe we've root caused this issue. It seems that with the default values of TC_REJOIN_INTERVAL_THRESHOLD_S (120s) and ZB_DEV_REJOIN_TIMEOUT_MS (200s) in effect, we would expect to see 200s of total rejoin attempts, 120s using Secure Rejoins and the remaining 80s using TC Rejoins. Because of how the exponential backoff is performed by the SDK, this results in 7 secure rejoin calls (start_network_steering) and 1 TC rejoin call (zb_bdb_initiate_tc_rejoin) before being cancelled. This masks an error in the logic of `rejoin_the_network` where the TC rejoin decision relies upon `zb_bdb_is_factory_new`. Starting a TC rejoin makes zb_bdb_is_factory_new()=1, which then means later retries in `rejoin_the_network` will go back to calling `start_network_steering` erroneously.

    If we remove the 200s limit, the behavior will be instead 6x start_network_steering calls, 1x zb_bdb_initiate_tc_rejoin calls (after ~255s), then endless calls to start_network_steering after that. This becomes an issue because the call to `zb_bdb_initiate_tc_rejoin` unlocks `rejoin_the_network` to double schedule a `start_network_steering` call back to back, with the same timeout value since it reaches the 900s cap. Double scheduling this call results in a fatal error as an `nlme_reset` process is executed before clearing the callback from the previous nlme_reset.

    In ZBOSS R23, it looks like `zb_bdb_is_factory_new` has been appropriately corrected with a bdb_not_ever_joined() call instead of bdb_joined(). When we applied this change, we see that the network steering behavior appropriately calls secure rejoining 7 times, then continually calls TC rejoins after that, as expected. This prevents the double scheduling of `start_network_reset` that was previously resulting in a fault.

    for nrfxlib:

    diff --git a/zboss/production/src/commissioning/bdb/zdo_commissioning_bdb.c b/zboss/production/src/commissioning/bdb/zdo_commissioning_bdb.c
    index 25782592..80dbcf15 100644
    --- a/zboss/production/src/commissioning/bdb/zdo_commissioning_bdb.c
    +++ b/zboss/production/src/commissioning/bdb/zdo_commissioning_bdb.c
    @@ -2581,7 +2581,7 @@ void zb_set_bdb_commissioning_mode(zb_uint8_t commissioning_mode)
     
     zb_bool_t zb_bdb_is_factory_new()
     {
    -  return (zb_bool_t)!bdb_joined();
    +  return (zb_bool_t)bdb_not_ever_joined();
     }
     
     
    

    After this change, we're seeing

    - no crash after 35 min

    - proper TC rejoin behavior working OK for as long as our rejoin max timeout is configured

    - proper trusted rejoin behavior during initial 120s

    Thanks,

    Kendall

  • Hi Kendall,

    Well done finding this yourself, our Zigbee team recently arrived at a solution as well. Our fix for this will probably be added to NCS v3.0.0. You could ask your RSM when that will be available, but as you have a solution yourself I assume you do not want to bother waiting. 

    Regarding ZB_DEV_REJOIN_TIMEOUT_MS though, I should mention that it seems to be a max value for that in this macro here. There is a 32 bit overflow. This means that the max should be around 0xFFFFFFFF/1000 which is about 71 minutes. 

    In theory this limitation could be overcome by avoiding the usage of the macro and by passing the timeout value in beacon intervals directly in start_network_rejoin(), but we can’t assure this approach would be crash-free. We haven't tested this ourselves. If you want to try this, the max possible value would be about 1180 h.

    Regards,

    Elfving

  • Elfving,

    Thanks for flagging the macro limitation - I didn't catch that on my first read through. We'll bypass it and run some extended testing to ensure the time window is still functional.

    FWIW: we do think this default behavior is pretty anomalous and misaligned with other Zigbee implementations. On a majority of the low-power / sleepy Zigbee End Devices that we use in our lab, the device basically always successfully performs either a secure rejoin or a TC rejoin, even if offline for days at a time. Especially on battery powered end devices, it seems likely that a simple power outage taking the hub offline for greater than the 200s rejoin window should not cause a permanent loss of connection, though I can understand some concerns with e.g. attempting legacy TC rejoins for a long period of time

    Thanks

    Kendall

Related