Zigbee ZBOSS Fatal Error After Changing ZB_DEV_REJOIN_TIMEOUT_MS

We are producing an issue where the ZBOSS stack will emit a ZBOSS Fatal Error without any context if we alter ZB_DEV_REJOIN_TIMEOUT_MS. Specifically, the steps to reproduce are:

1. Build light_bulb sample with device configured as an sleepy end device with TC_REJOIN enabled, adding the build flag ZB_DEV_REJOIN_TIMEOUT_MS=604800000

west build -b nrf52840dk_nrf52840 -- -DCMAKE_C_FLAGS="-DZB_DEV_REJOIN_TIMEOUT_MS=604800000"

2. Connect to a hub successfully

3. Remove power from the hub so that the end device enters rejoin steering. After 30-35 minutes of steering, we observe a ZBOSS fatal error.

In these circumstances, we are producing this failure every time on the light_bulb sample app with nrf52840dk. Since the fatal error is emitted within the ZBOSS stack and has no surrounding context, we are unsure what the underlying cause of the failure is. If we do not modify ZB_DEV_REJOIN_TIMEOUT_MS, a fatal error is not seen. We'd like to increase ZB_DEV_REJOIN_TIMEOUT in order to allow the device to be more resilient to loss of power in the coordinator, however this behavior seems to prohibit that.

nRF Connect SDK v2.6.0 - below attached is console log, ZBOSS trace, and built hex app which produces the issue.

/cfs-file/__key/communityserver-discussions-components-files/4/1854.merged.hex

/cfs-file/__key/communityserver-discussions-components-files/4/zboss_2D00_fatal_2D00_error.log.zip

Parents
  • Hi Kendall,

    1. Build light_bulb sample with device configured as an sleepy end device with TC_REJOIN enabled, adding the build flag ZB_DEV_REJOIN_TIMEOUT_MS=604800000

    I understand that you want to save power, though a week for ZB_DEV_REJOIN_TIMEOUT_MS is very high. I'm not seeing any info in the docs about any limit for this, I can look into that. 

    Are you seeing this for any value for ZB_DEV_REJOIN_TIMEOUT_MS  higher than these 30-35 minutes?

    Regards,

    Elfving

  • Elfving,

    Yes, if we set ZB_DEV_REJOIN_TIMEOUT_MS to 2400000 we still see the same failure case. Further, as a workaround, we tried to leave ZB_DEV_REJOIN_TIMEOUT_MS at its default value, but instead feed user_input_indicate() regularly to restart the steering process after it times out. However, even in this case, we still see the same fatal error after 30-35 minutes.

    For our product use case, we want to ensure that we are resilient to a power outage which would disable the hub for some period of time, but we still want to reconnect to the hub again once it is online without the need for user interaction with the product. As such, we feel that a 7d rejoin timeout will likely confirm that the device eventually connects to the hub, while not being too power expensive (since rejoin uses a exponential backoff capping at 15 minutes). 

    Thanks

    Kendall

  • Hi again Kendall, hope you've had a nice week-end. 

    I believe this is limited by REJOIN_INTERVAL_MAX_S, which is 15min by default. Could you try increasing that?

    Regards,

    Elfving

  • Hi Elfving,

    Yes- we're aware of the REJOIN_INTERVAL_MAX_S limit and OK with that. The only issue that we're facing is the ability to actually extend the maximum retry period to be longer than the 300 seconds specified by ZB_DEV_REJOIN_TIMEOUT_MS's default value due to the crash at 35 minutes.

    Any update on this issue?

    Thanks

    Kendall

Reply Children
  • I am looking into if there is some sort of max value for ZB_DEV_REJOIN_TIMEOUT_MS of 35minutes that makes this happen.

    It could also be that you are running into KRKNWK-12017, getting that solution in place to see if it helps might be an idea.

    What NCS version are you using btw?

    Regards,

    Elfving

  • I believe we have a workaround for KRKNWK-12017 in place, and I don't think it would alleviate any behavior here since ultimately it also just triggers a reset which we're trying to avoid.

    This is on NCS 2.6.1

    Thanks

    Kendall

  • Hi Kendall, 

    Just letting you know that we've now reproduced this on our side. I'll keep you updated.

    Regards,

    Elfving

  • Elfving,

    I believe we've root caused this issue. It seems that with the default values of TC_REJOIN_INTERVAL_THRESHOLD_S (120s) and ZB_DEV_REJOIN_TIMEOUT_MS (200s) in effect, we would expect to see 200s of total rejoin attempts, 120s using Secure Rejoins and the remaining 80s using TC Rejoins. Because of how the exponential backoff is performed by the SDK, this results in 7 secure rejoin calls (start_network_steering) and 1 TC rejoin call (zb_bdb_initiate_tc_rejoin) before being cancelled. This masks an error in the logic of `rejoin_the_network` where the TC rejoin decision relies upon `zb_bdb_is_factory_new`. Starting a TC rejoin makes zb_bdb_is_factory_new()=1, which then means later retries in `rejoin_the_network` will go back to calling `start_network_steering` erroneously.

    If we remove the 200s limit, the behavior will be instead 6x start_network_steering calls, 1x zb_bdb_initiate_tc_rejoin calls (after ~255s), then endless calls to start_network_steering after that. This becomes an issue because the call to `zb_bdb_initiate_tc_rejoin` unlocks `rejoin_the_network` to double schedule a `start_network_steering` call back to back, with the same timeout value since it reaches the 900s cap. Double scheduling this call results in a fatal error as an `nlme_reset` process is executed before clearing the callback from the previous nlme_reset.

    In ZBOSS R23, it looks like `zb_bdb_is_factory_new` has been appropriately corrected with a bdb_not_ever_joined() call instead of bdb_joined(). When we applied this change, we see that the network steering behavior appropriately calls secure rejoining 7 times, then continually calls TC rejoins after that, as expected. This prevents the double scheduling of `start_network_reset` that was previously resulting in a fault.

    for nrfxlib:

    diff --git a/zboss/production/src/commissioning/bdb/zdo_commissioning_bdb.c b/zboss/production/src/commissioning/bdb/zdo_commissioning_bdb.c
    index 25782592..80dbcf15 100644
    --- a/zboss/production/src/commissioning/bdb/zdo_commissioning_bdb.c
    +++ b/zboss/production/src/commissioning/bdb/zdo_commissioning_bdb.c
    @@ -2581,7 +2581,7 @@ void zb_set_bdb_commissioning_mode(zb_uint8_t commissioning_mode)
     
     zb_bool_t zb_bdb_is_factory_new()
     {
    -  return (zb_bool_t)!bdb_joined();
    +  return (zb_bool_t)bdb_not_ever_joined();
     }
     
     
    

    After this change, we're seeing

    - no crash after 35 min

    - proper TC rejoin behavior working OK for as long as our rejoin max timeout is configured

    - proper trusted rejoin behavior during initial 120s

    Thanks,

    Kendall

  • Hi Kendall,

    Well done finding this yourself, our Zigbee team recently arrived at a solution as well. Our fix for this will probably be added to NCS v3.0.0. You could ask your RSM when that will be available, but as you have a solution yourself I assume you do not want to bother waiting. 

    Regarding ZB_DEV_REJOIN_TIMEOUT_MS though, I should mention that it seems to be a max value for that in this macro here. There is a 32 bit overflow. This means that the max should be around 0xFFFFFFFF/1000 which is about 71 minutes. 

    In theory this limitation could be overcome by avoiding the usage of the macro and by passing the timeout value in beacon intervals directly in start_network_rejoin(), but we can’t assure this approach would be crash-free. We haven't tested this ourselves. If you want to try this, the max possible value would be about 1180 h.

    Regards,

    Elfving

Related