zigbee: similar issue as KRKNWK-12017

Hi,

SDK: nRF5 SDK for Thread and Zigbee v4.2.0

Chip: nRF52840

We're experiencing a very rare behavior on an end device implementation.

We haven't able to reproduce with logs due to the rare occurrence of the issue but we have some live instrumentation that allows us to get the history of ZBOSS signals.

When the issue happens we get a ZB_NWK_COMMAND_STATUS_PARENT_LINK_FAILURE and the device doesn't try to rejoin the coordinator anymore.

It looks very similar to the KRKNWK-12017 issue bu there is no broken rejoin procedure happening before the ZB_NWK_COMMAND_STATUS_PARENT_LINK_FAILURE event.

Here is the historic of ZBOSS signals for a device which was up for several days and then stops interaction with the coordinator:

  1. ZB_ZDO_SIGNAL_PRODUCTION_CONFIG_READY
  2. ZB_ZDO_SIGNAL_SKIP_STARTUP
  3. ZB_BDB_SIGNAL_DEVICE_REBOOT
  4. ZB_NLME_STATUS_INDICATION  => ZB_NWK_COMMAND_STATUS_PARENT_LINK_FAILURE

Could you recommend a procedure to fix this ? Maybe trigger a rejoin when the ZB_NWK_COMMAND_STATUS_PARENT_LINK_FAILURE occurs ?
FYI, resetting the device makes the device rejoin the network immediately

Thanks,

Sebastien

Parents
  • Hi Sebastien,

    Could you recommend a procedure to fix this ? Maybe trigger a rejoin when the ZB_NWK_COMMAND_STATUS_PARENT_LINK_FAILURE occurs ?

    Yes, since you are able to trace it to a stack signal, the best workaround would be to trigger a rejoin when the signal occurs. You can add a case for ZB_NLME_STATUS_INDICATION in zboss_signal_handler that checks if the status is ZB_NWK_COMMAND_STATUS_PARENT_LINK_FAILURE. To rejoin, you can call bdb_start_top_level_commissioning(ZB_BDB_NETWORK_STEERING).

    If you are able to reproduce the issue and collect sniffer logs, please let me know.

    Best regards,
    Marte

  • Thanks for your input.

    How to be sure we're not running into KRKNWK-12017 in which the ZBOSS stack requires a reset ?

  • Hi,

    We have a workaround for KRKNWK-12017:

    Complete the following steps to detect when the rejoin procedure breaks and reset the device:

    1. Introduce helper variable joining_signal_received.

    2. Extend zigbee_default_signal_handler() by completing the following steps:

      1. Set joining_signal_received to true in the following signals: ZB_BDB_SIGNAL_DEVICE_FIRST_START, ZB_BDB_SIGNAL_DEVICE_REBOOT, ZB_BDB_SIGNAL_STEERING.

      2. If leave_type is set to ZB_NWK_LEAVE_TYPE_REJOIN, set joining_signal_received to false in the ZB_ZDO_SIGNAL_LEAVE signal.

      3. Handle the ZB_NLME_STATUS_INDICATION signal to detect when End Device failed to transmit packet to its parent, reported by signal’s status ZB_NWK_COMMAND_STATUS_PARENT_LINK_FAILURE.

    See the following code snippet for an example:

    /* Add helper variable that will be used for detecting broken rejoin procedure. */
    /* Flag indicating if joining signal has been received since restart or leave with rejoin. */
    bool joining_signal_received = false;
    /* Extend the zigbee_default_signal_handler() function. */
    case ZB_BDB_SIGNAL_DEVICE_FIRST_START:
        ...
        joining_signal_received = true;
        break;
    case ZB_BDB_SIGNAL_DEVICE_REBOOT:
        ...
        joining_signal_received = true;
        break;
    case ZB_BDB_SIGNAL_STEERING:
        ...
        joining_signal_received = true;
        break;
    case ZB_ZDO_SIGNAL_LEAVE:
        if (status == RET_OK) {
            zb_zdo_signal_leave_params_t *leave_params = ZB_ZDO_SIGNAL_GET_PARAMS(sig_hndler, zb_zdo_signal_leave_params_t);
            LOG_INF("Network left (leave type: %d)", leave_params->leave_type);
    
            /* Set joining_signal_received to false so broken rejoin procedure can be detected correctly. */
            if (leave_params->leave_type == ZB_NWK_LEAVE_TYPE_REJOIN) {
                joining_signal_received = false;
            }
        ...
        break;
    case ZB_NLME_STATUS_INDICATION: {
        zb_zdo_signal_nlme_status_indication_params_t *nlme_status_ind =
            ZB_ZDO_SIGNAL_GET_PARAMS(sig_hndler, zb_zdo_signal_nlme_status_indication_params_t);
        if (nlme_status_ind->nlme_status.status == ZB_NWK_COMMAND_STATUS_PARENT_LINK_FAILURE) {
    
            /* Check for broken rejoin procedure and restart the device to recover. */
            if (stack_initialised && !joining_signal_received) {
                zb_reset(0);
            }
        }
        break;
    }

    Please note that this code snippet is from the nRF Connect SDK, but it should be very similar in nRF5 SDK for Thread and Zigbee.

    Best regards,
    Marte

Reply
  • Hi,

    We have a workaround for KRKNWK-12017:

    Complete the following steps to detect when the rejoin procedure breaks and reset the device:

    1. Introduce helper variable joining_signal_received.

    2. Extend zigbee_default_signal_handler() by completing the following steps:

      1. Set joining_signal_received to true in the following signals: ZB_BDB_SIGNAL_DEVICE_FIRST_START, ZB_BDB_SIGNAL_DEVICE_REBOOT, ZB_BDB_SIGNAL_STEERING.

      2. If leave_type is set to ZB_NWK_LEAVE_TYPE_REJOIN, set joining_signal_received to false in the ZB_ZDO_SIGNAL_LEAVE signal.

      3. Handle the ZB_NLME_STATUS_INDICATION signal to detect when End Device failed to transmit packet to its parent, reported by signal’s status ZB_NWK_COMMAND_STATUS_PARENT_LINK_FAILURE.

    See the following code snippet for an example:

    /* Add helper variable that will be used for detecting broken rejoin procedure. */
    /* Flag indicating if joining signal has been received since restart or leave with rejoin. */
    bool joining_signal_received = false;
    /* Extend the zigbee_default_signal_handler() function. */
    case ZB_BDB_SIGNAL_DEVICE_FIRST_START:
        ...
        joining_signal_received = true;
        break;
    case ZB_BDB_SIGNAL_DEVICE_REBOOT:
        ...
        joining_signal_received = true;
        break;
    case ZB_BDB_SIGNAL_STEERING:
        ...
        joining_signal_received = true;
        break;
    case ZB_ZDO_SIGNAL_LEAVE:
        if (status == RET_OK) {
            zb_zdo_signal_leave_params_t *leave_params = ZB_ZDO_SIGNAL_GET_PARAMS(sig_hndler, zb_zdo_signal_leave_params_t);
            LOG_INF("Network left (leave type: %d)", leave_params->leave_type);
    
            /* Set joining_signal_received to false so broken rejoin procedure can be detected correctly. */
            if (leave_params->leave_type == ZB_NWK_LEAVE_TYPE_REJOIN) {
                joining_signal_received = false;
            }
        ...
        break;
    case ZB_NLME_STATUS_INDICATION: {
        zb_zdo_signal_nlme_status_indication_params_t *nlme_status_ind =
            ZB_ZDO_SIGNAL_GET_PARAMS(sig_hndler, zb_zdo_signal_nlme_status_indication_params_t);
        if (nlme_status_ind->nlme_status.status == ZB_NWK_COMMAND_STATUS_PARENT_LINK_FAILURE) {
    
            /* Check for broken rejoin procedure and restart the device to recover. */
            if (stack_initialised && !joining_signal_received) {
                zb_reset(0);
            }
        }
        break;
    }

    Please note that this code snippet is from the nRF Connect SDK, but it should be very similar in nRF5 SDK for Thread and Zigbee.

    Best regards,
    Marte

Children
  • So to handle both cases you would recommend this ?

    case ZB_NLME_STATUS_INDICATION: {
        zb_zdo_signal_nlme_status_indication_params_t *nlme_status_ind =
            ZB_ZDO_SIGNAL_GET_PARAMS(sig_hndler, zb_zdo_signal_nlme_status_indication_params_t);
        if (nlme_status_ind->nlme_status.status == ZB_NWK_COMMAND_STATUS_PARENT_LINK_FAILURE) {
    
            /* Check for broken rejoin procedure and restart the device to recover. */
            if (stack_initialised && !joining_signal_received) {
                zb_reset(0);
            } else {
                bdb_start_top_level_commissioning(ZB_BDB_NETWORK_STEERING);
            }
        }
        break;

  • Hi,

    Yes, that should take care of both types of failures. However, I recommend testing it to verify that it works as expected.

    Best regards,
    Marte

  • Yes, that should take care of both types of failures. However, I recommend testing it to verify that it works as expected.

    Let me be clear here. It's the second major issue we had with nrf52840 and Zigbee SDK 4.2.0 (released recently which was supposed to be a stable production release)

    We had to recall products because of stability issues of the SDK / ZBOSS stack.
    With the closed source ZBOSS stack and the binary trace it's impossible to debug by ourselves and it's time and resource consuming to debug with the support.

    KRKNWK-12017 is clearly a production stop issue. Why hasn't it be fixed by ZBoss stack ? Why the workaround isn't officially integrated ?

    I really like the software ecosystem, the SDK and such but quality should be highly prioritized in Nordic.

    So in conclusion, yes, don't worry, we will test.

  • FYI
    In the current product we have 2 watchdogs, 1 periodic reset every 12 hours and the KRKNWK-12017 workaround to manage having a stable product

  • Hi,

    Cheb said:
    KRKNWK-12017 is clearly a production stop issue. Why hasn't it be fixed by ZBoss stack ? Why the workaround isn't officially integrated ?

    The issue has been reported to DSR. However, the nRF5 SDK for Thread and Zigbee is deprecated and will not be upgraded.

    Best regards,
    Marte

Related